生成时间: 2026-02-03 16:45:37 (UTC+8); Arxiv 发布时间: 2026-02-03 20:00 EST (2026-02-04 09:00 UTC+8)

今天共有 134 篇相关文章

Keyword: reinforcement learning

AutoBool: An Reinforcement-Learning trained LLM for Effective Automated Boolean Query Generation for Systematic Reviews

AutoBool:一个强化学习训练的大型语言模型,用于系统性综述的有效自动布尔查询生成

Representation Learning Enhanced Deep Reinforcement Learning for Optimal Operation of Hydrogen-based Multi-Energy Systems

表征学习增强深度强化学习,实现氢基多能系统的最佳运行

Asynchronous MultiAgent Reinforcement Learning for 5G Routing under Side Constraints

侧约束下的5G路由异步多智能体强化学习

Distributional Reinforcement Learning for Condition-Based Maintenance of Multi-Pump Equipment

分布式强化学习用于基于条件的多泵设备维护

Joint Continual Learning of Local Language Models and Cloud Offloading Decisions with Budget Constraints

本地语言模型与云分销决策的联合持续学习,预算约束

Learning Robust Reasoning through Guided Adversarial Self-Play

通过引导式对抗性自我扮演学习扎实的推理

CamReasoner: Reinforcing Camera Movement Understanding via Structured Spatial Reasoning

CamReasoner:通过结构化空间推理强化对摄像机运动的理解

From Gameplay Traces to Game Mechanics: Causal Induction with Large Language Models

从游戏机制到游戏机制:大型语言模型的因果归纳

Sample Complexity Analysis for Constrained Bilevel Reinforcement Learning

受限双级强化学习的示例复杂性分析

AdaFuse: Adaptive Multimodal Fusion for Lung Cancer Risk Prediction via Reinforcement Learning

AdaFuse:通过强化学习预测肺癌风险的自适应多模态融合

MASC: Metal-Aware Sampling and Correction via Reinforcement Learning for Accelerated MRI

MASC:金属感知采样与通过强化学习修正加速MRI

ReLAPSe: Reinforcement-Learning-trained Adversarial Prompt Search for Erased concepts in unlearned diffusion models

ReLAPSe:强化学习训练的对抗提示搜索,在未学扩散模型中消除概念

KEPO: Knowledge-Enhanced Preference Optimization for Reinforcement Learning with Reasoning

KEPO:基于推理的知识增强偏好优化,用于强化学习

ZEST: Zero-shot Embodied Skill Transfer for Athletic Robot Control

ZEST:零射击具体技能转移,用于运动机器人控制

DROGO: Default Representation Objective via Graph Optimization in Reinforcement Learning

DROGO:通过强化学习中的图优化实现默认表示目标

Variational Approach for Job Shop Scheduling

工作车间调度的变分方法

Open Materials Generation with Inference-Time Reinforcement Learning

带推理时间强化学习的开放材料生成

LLMs as High-Dimensional Nonlinear Autoregressive Models with Attention: Training, Alignment and Inference

作为高维非线性自回归模型的LLMs:训练、对齐与推断

FedMOA: Federated GRPO for Personalized Reasoning LLMs under Heterogeneous Rewards

FedMOA:针对异构奖励下的个性化推理LLM的联合GRPO

Search Inspired Exploration in Reinforcement Learning

强化学习中的搜索启发探索

AREAL-DTA: Dynamic Tree Attention for Efficient Reinforcement Learning of Large Language Models

AREAL-DTA:动态树注意力用于高效强化大型语言模型的学习

Minerva: Reinforcement Learning with Verifiable Rewards for Cyber Threat Intelligence LLMs

Minerva:针对网络威胁情报大型语言模型的可验证奖励强化学习

How Far Are LLMs from Professional Poker Players? Revisiting Game-Theoretic Reasoning with Agentic Tool Use

大型语言模型离职业扑克玩家有多远?重新审视结合智能工具的博弈论推理

Reinforcement Learning-assisted Constraint Relaxation for Constrained Expensive Optimization

强化学习辅助约束松弛以实现受限且昂贵的优化

Surrogate Ensemble in Expensive Multi-Objective Optimization via Deep Q-Learning

通过深度Q学习实现昂贵多目标优化中的替代集合

APEX: A Decoupled Memory-based Explorer for Asynchronous Aerial Object Goal Navigation

APEX:一种基于内存的解耦探索器,用于异步航天目标导航

NetWorld: Communication-Based Diffusion World Model for Multi-Agent Reinforcement Learning in Wireless Networks

NetWorld:无线网络中多智能体强化学习的基于通信扩散世界模型

Learning to Decode Against Compositional Hallucination in Video Multimodal Large Language Models

学习在视频多模态大型语言模型中解码组合幻觉

Learning Modal-Mixed Chain-of-Thought Reasoning with Latent Embeddings

学习带有潜在嵌入的模态混合思维链推理

Agentic Reward Modeling: Verifying GUI Agent via Online Proactive Interaction

代理奖励建模:通过在线主动交互验证图形界面代理

Safe Langevin Soft Actor Critic

安全朗热文软性演员评论家

Model-Based Data-Efficient and Robust Reinforcement Learning

基于模型的数据高效且稳健的强化学习

Towards Sample-Efficient and Stable Reinforcement Learning for LLM-based Recommendation

迈向基于LLM的推荐中样本高效且稳定的强化学习

Equilibrium of Feasible Zone and Uncertain Model in Safe Exploration

安全勘探中可行区与不确定模型的平衡

LegalOne: A Family of Foundation Models for Reliable Legal Reasoning

LegalOne:一系列可靠法律推理的基础模型

Toward Reliable Sim-to-Real Predictability for MoE-based Robust Quadrupedal Locomotion

迈向基于MoE的稳健四足行走的可靠模拟到真实可预测性

SA-VLA: Spatially-Aware Flow-Matching for Vision-Language-Action Reinforcement Learning

SA-VLA:视觉-语言-行动强化学习中的空间感知流匹配

ACE-Step 1.5: Pushing the Boundaries of Open-Source Music Generation

ACE步骤1.5:推动开源音乐创作的边界

Adaptive Ability Decomposing for Unlocking Large Reasoning Model Effective Reinforcement Learning

自适应能力分解以解锁大型推理模型 有效强化学习

Communications-Incentivized Collaborative Reasoning in NetGPT through Agentic Reinforcement Learning

通过智能强化学习实现NetGPT中的沟通激励协作推理

Fast Non-Episodic Finite-Horizon RL with K-Step Lookahead Thresholding

快速非情节有限地平线强化学习,采用K步前瞻阈值

World Models as an Intermediary between Agents and the Real World

作为代理与现实世界之间的中介的世界模型

DVLA-RL: Dual-Level Vision-Language Alignment with Reinforcement Learning Gating for Few-Shot Learning

DVLA-RL:双层视觉-语言对齐辅导学习门槛,用于少数样本学习

Resource-Efficient Reinforcement for Reasoning Large Language Models via Dynamic One-Shot Policy Refinement

通过动态一次性策略细化,资源高效强化大型语言模型推理

Omni-RRM: Advancing Omni Reward Modeling via Automatic Rubric-Grounded Preference Synthesis

Omni-RRM:通过自动评分标准基础偏好综合推进全奖励建模

Learning Abstractions for Hierarchical Planning in Program-Synthesis Agents

学习程序综合代理中层级规划的抽象

Unveiling the Cognitive Compass: Theory-of-Mind-Guided Multimodal Emotion Reasoning

揭示认知罗盘:心智理论引导的多模态情绪推理

DISPO: Enhancing Training Efficiency and Stability in Reinforcement Learning for Large Language Model Mathematical Reasoning

DISPO:提升大型语言模型数学推理强化学习的训练效率和稳定性

Reasoning and Tool-use Compete in Agentic RL:From Quantifying Interference to Disentangled Tuning

推理与工具使用在智能强化学习中相互竞争:从量化干扰到解开纠缠的调谐

Reliable Use of Lemmas via Eligibility Reasoning and Section$-$Aware Reinforcement Learning

通过资格推理和Section$-$Aware强化学习的可靠引理使用

ESSAM: A Novel Competitive Evolution Strategies Approach to Reinforcement Learning for Memory Efficient LLMs Fine-Tuning

ESSAM:一种新型竞争性进化策略强化学习方法,用于记忆高效LLM微调

Discovering Process-Outcome Credit in Multi-Step LLM Reasoning

在多步大型语言模型推理中发现过程-结果学分

Good SFT Optimizes for SFT, Better SFT Prepares for Reinforcement Learning

好的SFT优化SFT,更优秀的SFT为强化学习做准备

SetPO: Set-Level Policy Optimization for Diversity-Preserving LLM Reasoning

SetPO:用于多样性保持LLM推理的集合级策略优化

Probing RLVR training instability through the lens of objective-level hacking

通过客观层级黑客的视角探究RLVR训练不稳定性

Lyapunov Stability-Aware Stackelberg Game for Low-Altitude Economy: A Control-Oriented Pruning-Based DRL Approach

低空经济的Lyapunov稳定性感知Stackelberg博弈:一种基于控制的剪枝驱动日程方法

Parallel Training in Spiking Neural Networks

尖峰神经网络的并行训练

Self-Generative Adversarial Fine-Tuning for Large Language Models

大型语言模型的自生成对抗微调

PolicyFlow: Policy Optimization with Continuous Normalizing Flow in Reinforcement Learning

策略流程:强化学习中持续规范化流程的策略优化

Med3D-R1: Incentivizing Clinical Reasoning in 3D Medical Vision-Language Models for Abnormality Diagnosis

Med3D-R1:激励三维医学视觉语言模型中的临床推理以诊断异常

ASTER: Agentic Scaling with Tool-integrated Extended Reasoning

ASTER:带有工具集成扩展推理的智能尺度

Sample Efficient Active Algorithms for Offline Reinforcement Learning

线下强化学习的高效主动算法示例

Reinforcement Learning for Active Perception in Autonomous Navigation

自主导航中主动感知的强化学习

Mixture-of-World Models: Scaling Multi-Task Reinforcement Learning with Modular Latent Dynamics

世界混合模型:利用模块化潜在动力学扩展多任务强化学习

From Intents to Actions: Agentic AI in Autonomous Networks

从意图到行动:自主网络中的智能人工智能

AOASS: Adaptive Obstacle-Aware Square Spiral Framework for Single-mobile Anchor-Based WSN Localization

AOASS:单移动锚点WSN定位的自适应障碍感知方形螺旋框架

What Does Vision Tool-Use Reinforcement Learning Really Learn? Disentangling Tool-Induced and Intrinsic Effects for Crop-and-Zoom

视觉工具使用强化学习到底学到了什么?裁剪与缩放工具诱导与内在效应的解开

Adaptive Quantum-Safe Cryptography for 6G Vehicular Networks via Context-Aware Optimization

通过上下文感知优化实现6G车载网络的自适应量子安全密码学

CRAFT: Calibrated Reasoning with Answer-Faithful Traces via Reinforcement Learning for Multi-Hop Question Answering

CRAFT:通过强化学习实现多跳问答的校准推理与答案忠实追踪

PromptRL: Prompt Matters in RL for Flow-Based Image Generation

PromptRL:基于流的图像生成中的提示重要性

The Enhanced Physics-Informed Kolmogorov-Arnold Networks: Applications of Newton's Laws in Financial Deep Reinforcement Learning (RL) Algorithms

增强型物理知情的柯尔莫哥洛夫-阿诺德网络:牛顿定律在金融深度强化学习(RL)算法中的应用

TQL: Scaling Q-Functions with Transformers by Preventing Attention Collapse

TQL:通过防止注意力崩溃来扩展变换器的Q函数

Provable Cooperative Multi-Agent Exploration for Reward-Free MDPs

可证明的无奖励多智能体合作探索

ConPress: Learning Efficient Reasoning from Multi-Question Contextual Pressure

ConPress:从多问情境压力中学习高效推理

Alternating Reinforcement Learning for Rubric-Based Reward Modeling in Non-Verifiable LLM Post-Training

交替强化学习用于不可验证的LLM后训练中基于评分标准的奖励建模

A Relative-Budget Theory for Reinforcement Learning with Verifiable Rewards in Large Language Model Reasoning

大型语言模型推理中可验证奖励的相对预算强化学习理论

Making Bias Non-Predictive: Training Robust LLM Judges via Reinforcement Learning

使偏见非预测性:通过强化学习培养稳健的LLM评判者

MAGIC: A Co-Evolving Attacker-Defender Adversarial Game for Robust LLM Safety

MAGIC:一款共同进化的攻防对抗游戏,增强强大LLM安全

Toward Cognitive Supersensing in Multimodal Large Language Model

迈向多模态大型语言模型中的认知超感知

AdNanny: One Reasoning LLM for All Offline Ads Recommendation Tasks

AdNanny:适用于所有离线广告推荐任务的一种逻辑大型语言模型

Know Your Step: Faster and Better Alignment for Flow Matching Models via Step-aware Advantages

了解你的步数:通过步进感知优势,更快更好地对齐流量匹配模型

The Multiple Ticket Hypothesis: Random Sparse Subnetworks Suffice for RLVR

多票假说:随机稀疏子网足以解释RLVR

Adaptive Rollout Allocation for Online Reinforcement Learning with Verifiable Rewards

在线强化学习的自适应推广分配,附带可验证奖励

Boosting Maximum Entropy Reinforcement Learning via One-Step Flow Matching

通过一步流匹配提升最大熵强化学习

SUSD: Structured Unsupervised Skill Discovery through State Factorization

SUSD:通过状态分解进行结构化无监督技能发现

PISCES: Annotation-free Text-to-Video Post-Training via Optimal Transport-Aligned Rewards

PISCES:通过最优运输对齐奖励实现无注释的文本转视频后培训

Contribution-aware Token Compression for Efficient Video Understanding via Reinforcement Learning

贡献感知令牌压缩,通过强化学习实现高效的视频理解

FlowSteer: Interactive Agentic Workflow Orchestration via End-to-End Reinforcement Learning

FlowSteer:通过端到端强化学习实现交互式代理工作流编排

TABX: A High-Throughput Sandbox Battle Simulator for Multi-Agent Reinforcement Learning

TABX:一款高通量沙盒战斗模拟器,用于多智能体强化学习

Scaling Search-Augmented LLM Reasoning via Adaptive Information Control

通过自适应信息控制扩展搜索增强LLM推理

TRIP-Bench: A Benchmark for Long-Horizon Interactive Agents in Real-World Scenarios

TRIP-Bench:现实场景中长视野交互代理的基准

Semantic-aware Wasserstein Policy Regularization for Large Language Model Alignment

语义感知的 Wasserstein 策略正则化用于大型语言模型对齐

Restoring Exploration after Post-Training: Latent Exploration Decoding for Large Reasoning Models

训练后恢复探索:大型推理模型的潜在探索解码

Mitigating loss of control in advanced AI systems through instrumental goal trajectories

通过工具性目标轨迹减轻先进人工智能系统中失控的问题

Beyond Mode Elicitation: Diversity-Preserving Reinforcement Learning via Latent Diffusion Reasoner

超越模式诱发:通过潜在扩散推理器实现多样性保持强化学习

Uncertainty-Aware Non-Prehensile Manipulation with Mobile Manipulators under Object-Induced Occlusion

在物体诱导遮蔽下,使用移动作器的不确定性感知非抓握作

Adversarial Reward Auditing for Active Detection and Mitigation of Reward Hacking

对抗性奖励审计用于主动检测和缓解奖励黑客行为

Position: Beyond Model-Centric Prediction -- Agentic Time Series Forecasting

立场:超越以模型为中心的预测——代理时间序列预测

RFS: Reinforcement learning with Residual flow steering for dexterous manipulation

RFS:基于残留流引导的强化学习,实现灵巧作

Grad2Reward: From Sparse Judgment to Dense Rewards for Improving Open-Ended LLM Reasoning

Grad2Reward:从稀疏判断到高密度奖励,提升开放式大型语言模型推理能力

Beyond Precision: Training-Inference Mismatch is an Optimization Problem and Simple LR Scheduling Fixes It

超越精度:训练-推断不匹配是一个优化问题,简单的逻辑推理调度可以解决

Designing Time Series Experiments in A/B Testing with Transformer Reinforcement Learning

利用变压器强化学习设计A/B测试时间序列实验

PretrainRL: Alleviating Factuality Hallucination of Large Language Models at the Beginning

PretrainRL:缓解大型语言模型初期的事实性幻觉

VLM-Guided Experience Replay

VLM引导体验回放

Zero-Shot Off-Policy Learning

零单点非策略学习

Small Generalizable Prompt Predictive Models Can Steer Efficient RL Post-Training of Large Reasoning Models

小型可推广提示预测模型可以引导大型推理模型的高效强化学习后训练

Bandwidth-Efficient Multi-Agent Communication through Information Bottleneck and Vector Quantization

通过信息瓶颈和矢量量化实现带宽高效的多智能体通信

FORLER: Federated Offline Reinforcement Learning with Q-Ensemble and Actor Rectification

FORLER:结合Q-Ensemble和演员纠正的联合离线强化学习

Probabilistic Performance Guarantees for Multi-Task Reinforcement Learning

多任务强化学习的概率性能保证

Think Dense, Not Long: Dynamic Decoupled Conditional Advantage for Efficient Reasoning

思考密集,而非长:动态解耦条件优势以实现高效推理

DCoPilot: Generative AI-Empowered Policy Adaptation for Dynamic Data Center Operations

DCoPilot:生成式AI赋能的政策适应,适用于动态数据中心运营

Learning Generative Selection for Best-of-N

学习N中最佳生成选择

ECHO: Entropy-Confidence Hybrid Optimization for Test-Time Reinforcement Learning

ECHO:熵-置信度混合优化用于测试时间强化学习

D-CORE: Incentivizing Task Decomposition in Large Reasoning Models for Complex Tool Use

D-CORE:在大型推理模型中激励任务分解以适应复杂工具的使用。

ECHO-2: A Large Scale Distributed Rollout Framework for Cost-efficient Reinforcement Learning

ECHO-2:一个大规模分布式推广框架,实现成本效益高的强化学习

Online Fine-Tuning of Pretrained Controllers for Autonomous Driving via Real-Time Recurrent RL

通过实时循环强化学习,在线微调预训练控制器的自动驾驶

Learning While Staying Curious: Entropy-Preserving Supervised Fine-Tuning via Adaptive Self-Distillation for Large Reasoning Models

保持好奇心的学习:通过自适应自蒸馏对大型推理模型进行保持熵的监督微调

Segment to Focus: Guiding Latent Action Models in the Presence of Distractors

聚焦细分:在干扰因素存在下引导潜在动作模型

Learning Markov Decision Processes under Fully Bandit Feedback

在完全强盗反馈下学习马尔可夫决策过程

Kimi K2.5: Visual Agentic Intelligence

Kimi K2.5:视觉智能

Choice-Model-Assisted Q-learning for Delayed-Feedback Revenue Management

选择模型辅助Q学习用于延迟反馈收入管理

Advancing General-Purpose Reasoning Models with Modular Gradient Surgery

通过模块化梯度手术推进通用推理模型

Position: Explaining Behavioral Shifts in Large Language Models Requires a Comparative Approach

立场:解释大型语言模型中的行为转变需要比较方法

SWE-Universe: Scale Real-World Verifiable Environments to Millions

SWE宇宙:将现实世界中可验证的环境规模扩展到数百万

Proof-RM: A Scalable and Generalizable Reward Model for Math Proof

证明-RM:一种可扩展且可推广的数学证明奖励模型

Unified Personalized Reward Model for Vision Generation

统一个性化奖励模型用于视觉生成

SLIME: Stabilized Likelihood Implicit Margin Enforcement for Preference Optimization

SLIME:稳定似然隐性边际强制执行以优化偏好

David vs. Goliath: Verifiable Agent-to-Agent Jailbreaking via Reinforcement Learning

大卫对歌利亚:通过强化学习实现可验证的代理间越狱

World-Gymnast: Training Robots with Reinforcement Learning in a World Model

世界体运动员:在世界模型中用强化学习训练机器人

Conflict-Aware Client Selection for Multi-Server Federated Learning

多服务器联合学习中的冲突感知客户端选择

TIC-VLA: A Think-in-Control Vision-Language-Action Model for Robot Navigation in Dynamic Environments

TIC-VLA:一种动态环境中机器人导航的控制思维视觉-语言-行动模型

Training LLMs for Divide-and-Conquer Reasoning Elevates Test-Time Scalability

为分而治之的大型语言模型训练提升了测试时间的可扩展性

RLAnything: Forge Environment, Policy, and Reward Model in Completely Dynamic RL System

RLAnything:完全动态强化学习系统中的锻造环境、政策与奖励模型

Keyword: diffusion policy

CLAMP: Contrastive Learning for 3D Multi-View Action-Conditioned Robotic Manipulation Pretraining

CLAMP:三维多视角动作条件机器人作预训练的对比学习