生成时间: 2026-04-21 17:45:34 (UTC+8); Arxiv 发布时间: 2026-04-21 20:00 EDT (2026-04-22 08:00 UTC+8)

今天共有 80 篇相关文章

Keyword: reinforcement learning

CFMS: Towards Explainable and Fine-Grained Chinese Multimodal Sarcasm Detection Benchmark

CFMS:迈向可解释且细粒度的中国多模态讽刺检测基准

Reciprocal Co-Training (RCT): Coupling Gradient-Based and Non-Differentiable Models via Reinforcement Learning

互惠协同训练(RCT):通过强化学习耦合基于梯度的模型与不可微模型

GraphRAG-Router: Learning Cost-Efficient Routing over GraphRAGs and LLMs with Reinforcement Learning

GraphRAG 路由器:通过强化学习学习通过 GraphRAG 和大型语言模型学习成本效益高的路由

Fuzzy Encoding-Decoding to Improve Spiking Q-Learning Performance in Autonomous Driving

模糊编码-解码以提升自动驾驶中Q-learning的峰值性能

Sampling for Quality: Training-Free Reward-Guided LLM Decoding via Sequential Monte Carlo

质量抽样:通过顺序蒙特卡洛进行无培训奖励引导LLM解码

Training Language Models for Bilateral Trade with Private Information

双边信息贸易的语言模型训练

Positive-Only Drifting Policy Optimization

仅正漂移策略优化

S-GRPO: Unified Post-Training for Large Vision-Language Models

S-GRPO:大型视觉语言模型统一后训练

Agentic AI for Education: A Unified Multi-Agent Framework for Personalized Learning and Institutional Intelligence

教育中的代理人工智能:一个统一的多智能体框架,用于个性化学习和机构智能

AVRT: Audio-Visual Reasoning Transfer through Single-Modality Teachers

AVRT:通过单一模式教师进行视听推理转移

DARLING: Detection Augmented Reinforcement Learning with Non-Stationary Guarantees

亲爱的:带有非平稳保证的检测增强强化学习

Autonomous Vehicle Collision Avoidance With Racing Parameterized Deep Reinforcement Learning

利用赛车参数化深度强化学习实现自动驾驶车辆碰撞避免

Debate as Reward: A Multi-Agent Reward System for Scientific Ideation via RL Post-Training

辩论作为奖励:通过强化学习后培训实现科学构思的多代理奖励系统

Active World-Model with 4D-informed Retrieval for Exploration and Awareness

采用4D导向检索的主动世界模型,促进探索与意识提升

Privacy-Aware Machine Unlearning with SISA for Reinforcement Learning-Based Ransomware Detection

基于强化学习的基于SISA的隐私感知机器学习解构

A Stackelberg Game Framework with Drainability Guardrails for Pricing and Scaling in Multi-Tenant GPU Cloud Platforms

一个带有排水保护措施的Stackelberg游戏框架,用于多租户GPU云平台的定价和扩展

AutoOR: Scalably Post-training LLMs to Autoformalize Operations Research Problems

AutoOR:可扩展的后期训练LLM以自我形式化运筹学问题

Q-DeepSight: Incentivizing Thinking with Images for Image Quality Assessment and Refinement

Q-DeepSight:通过图像激励思维以评估和优化图像质量

GRAIL: Autonomous Concept Grounding for Neuro-Symbolic Reinforcement Learning

圣杯:神经符号强化学习的自主概念基础

Incentivizing Parametric Knowledge via Reinforcement Learning with Verifiable Rewards for Cross-Cultural Entity Translation

通过强化学习激励参数化知识,并提供可验证的跨文化实体翻译奖励

EasyVideoR1: Easier RL for Video Understanding

EasyVideoR1:更简单的强化学习视频理解

Freshness-Aware Prioritized Experience Replay for LLM/VLM Reinforcement Learning

LLM/VLM强化学习的新鲜感感知优先体验重放

Multi-stage Planning for Multi-target Surveillance using Aircrafts Equipped with Synthetic Aperture Radars Aware of Target Visibility

多阶段多目标监视规划,配备合成孔径雷达的飞机,感知目标可见性

NaviFormer: A Deep Reinforcement Learning Transformer-like Model to Holistically Solve the Navigation Problem

NaviFormer:一个深度强化学习类变换器模型,整体解决导航问题

MCPO: Mastery-Consolidated Policy Optimization for Large Reasoning Models

MCPO:大型推理模型的掌握整合策略优化

SPS: Steering Probability Squeezing for Better Exploration in Reinforcement Learning for Large Language Models

SPS:引导概率挤压,促进大型语言模型强化学习的更好探索

Small Model as Master Orchestrator: Learning Unified Agent-Tool Orchestration with Parallel Subtask Decomposition

小模型作为主编排器:学习带并行子任务分解的统一代理-工具编排

Web-Gewu: A Browser-Based Interactive Playground for Robot Reinforcement Learning

Web-Gewu:基于浏览器的机器人强化学习互动游乐场

Live LTL Progress Tracking: Towards Task-Based Exploration

实时LTL进展追踪:迈向基于任务的探索

Do LLM-derived graph priors improve multi-agent coordination?

LLM导出的图先验是否能改善多智能体协调?

Guardrails in Logit Space: Safety Token Regularization for LLM Alignment

Logit 空间中的护栏:安全令牌规范化以实现 LLM 对齐

Beyond "I Don't Know": Evaluating LLM Self-Awareness in Discriminating Data and Model Uncertainty

超越“我不知道”:评估大语言模型在辨别数据和模型不确定性的自我意识

A Survey of Reinforcement Learning for Large Language Models under Data Scarcity: Challenges and Solutions

数据稀缺下大型语言模型强化学习概览:挑战与解决方案

Rethinking the Comparison Unit in Sequence-Level Reinforcement Learning: An Equal-Length Paired Training Framework from Loss Correction to Sample Construction

重新思考序列级强化学习中的比较单元:从损失纠正到样本构建的等长配对训练框架

Learning Whole-Body Humanoid Locomotion via Motion Generation and Motion Tracking

通过运动生成和运动追踪学习全身类人运动

AutoSearch: Adaptive Search Depth for Efficient Agentic RAG via Reinforcement Learning

AutoSearch:通过强化学习实现高效代理RAG的自适应搜索深度

RISC-V Functional Safety for Autonomous Automotive Systems: An Analytical Framework and Research Roadmap for ML-Assisted Certification

RISC-V 自动驾驶汽车系统功能安全:机器学习辅助认证的分析框架与研究路线图

Think before Go: Hierarchical Reasoning for Image-goal Navigation

三思而后行:图像目标导航的层级推理

TrafficClaw: Generalizable Urban Traffic Control via Unified Physical Environment Modeling

TrafficClaw:通过统一物理环境建模实现的可通用城市交通控制

Waking Up Blind: Cold-Start Optimization of Supervision-Free Agentic Trajectories for Grounded Visual Perception

《盲醒醒来:监督冷启动优化——无行动轨迹以实现视觉感知的扎根

RS-HyRe-R1: A Hybrid Reward Mechanism to Overcome Perceptual Inertia for Remote Sensing Images Understanding

RS-HyRe-R1:一种混合奖励机制,用于克服感知惯性以实现遥感图像理解

PoliLegalLM: A Technical Report on a Large Language Model for Political and Legal Affairs

PoliLegalLM:关于政治与法律事务大型语言模型的技术报告

SVL: Goal-Conditioned Reinforcement Learning as Survival Learning

SVL:目标条件强化学习作为生存学习

COSEARCH: Joint Training of Reasoning and Document Ranking via Reinforcement Learning for Agentic Search

COSEARCH:通过强化学习进行推理与文档排序的联合训练,用于代理搜索

Poly-EPO: Training Exploratory Reasoning Models

多元EPO:探索性推理模型的训练

OmniVLA-RL: A Vision-Language-Action Model with Spatial Understanding and Online RL

OmniVLA-RL:具备空间理解与在线强化学习的视觉-语言-行动模型

Tool Learning Needs Nothing More Than a Free 8B Language Model

工具学习只需要一个免费的8B语言模型

Input-Side Variance Suppression under Non-Normal Transient Amplification in Continuous-Control Reinforcement Learning

连续控制强化学习中非正态瞬态放大下的输入端方差抑制

Efficient Federated RLHF via Zeroth-Order Policy Optimization

通过零阶策略优化实现高效的联邦RLHF

Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF

逆宪法人工智能:通过概率限制RLAIF实现可控有毒数据生成的框架

Re$^2$MoGen: Open-Vocabulary Motion Generation via LLM Reasoning and Physics-Aware Refinement

Re$^2$MoGen:通过大型语言模型推理和物理感知精炼实现的开放词汇运动生成

DART: Learning-Enhanced Model Predictive Control for Dual-Arm Non-Prehensile Manipulation

DART:双臂非抓握操作的学习增强模型预测控制

LEPO: \underline{L}atent R\underline{e}asoning \underline{P}olicy \underline{O}ptimization for Large Language~Models

LEPO: \underline{L}atent R\underline{e}asoning \underline{P}olicy \underline{O}ptimization for Large Language~Models

Fisher Decorator: Refining Flow Policy via A Local Transport Map

Fisher 装饰师:通过本地交通地图优化流量政策

HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment

熵坍缩的修复:通过混合域熵动力学对齐增强少数样本RLVR的探索

LiteResearcher: A Scalable Agentic RL Training Framework for Deep Research Agent

LiteResearcher:一个可扩展的智能强化学习深度研究代理训练框架

Modeling Multiple Support Strategies within a Single Turn for Emotional Support Conversations

在单一回合内建模多种情感支持策略以促进情感支持对话

Neural Garbage Collection: Learning to Forget while Learning to Reason

神经垃圾回收:在学习推理的同时学会遗忘

SELF-EMO: Emotional Self-Evolution from Recognition to Consistent Expression

自我情绪化:从认知到持续表达的情感自我演化

CodePivot: Bootstrapping Multilingual Transpilation in LLMs via Reinforcement Learning without Parallel Corpora

CodePivot:通过强化学习在LLM中自助实现多语言转译,无需并行语料

ConventionPlay: Capability-Limited Training for Robust Ad-Hoc Collaboration

ConventionPlay:能力有限的培训,促进强有力的临时协作

Frugal Geofencing via Energy-aware Sensing and Reporting

通过能源感知和报告实现节约地理围栏

Does "Do Differentiable Simulators Give Better Policy Gradients?'' Give Better Policy Gradients?

“可微模拟器是否能提供更好的策略梯度?”给出更好的政策梯度?

QuantumQA: Enhancing Scientific Reasoning via Physics-Consistent Dataset and Verification-Aware Reinforcement Learning

量子质量保证:通过物理一致性数据集和验证感知强化学习提升科学推理能力

Audio-DeepThinker: Progressive Reasoning-Aware Reinforcement Learning for High-Quality Chain-of-Thought Emergence in Audio Language Models

音频深度思考者:渐进式推理感知强化学习,促进音频语言模型中高质量的思维链涌现

AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation

AJ-Bench:环境感知评估的基准代理作为法官

Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence

代理世界:扩展真实世界环境综合以演进通用智能体智能

Training and Agentic Inference Strategies for LLM-based Manim Animation Generation

基于LLM的Manim动画生成的训练与代理推理策略

Learning from Less: Measuring the Effectiveness of RLVR in Low Data and Compute Regimes

从更少中学习:衡量RLVR在低数据和计算体系中的有效性

OpenGame: Open Agentic Coding for Games

OpenGame:游戏中的开放代理编码

StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning

StepPO:代理强化学习的步进对齐策略优化

Knowing When to Quit: A Principled Framework for Dynamic Abstention in LLM Reasoning

知道何时该退出:LLM推理中动态戒断的原则框架

Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts

分别培训,合并:专家组合的模块化培训后

XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments

XEmbodied:一个基于大型具身环境增强几何和物理线索的基础模型

Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data

过于正确而难以学习:饱和推理数据上的强化学习

Different Paths to Harmful Compliance: Behavioral Side Effects and Mechanistic Divergence Across LLM Jailbreaks

有害合规的不同路径:大型语言模型越狱的行为副作用与机制性分歧

UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models

UDM-GRPO:均匀离散扩散模型的稳定高效群相对策略优化

OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning

OGER:一个强有力的离线引导探索奖励,用于混合强化学习

When Can LLMs Learn to Reason with Weak Supervision?

大型语言模型(LLM)什么时候能学会在弱监督下进行推理?

Bounded Ratio Reinforcement Learning

有界比率强化学习

Keyword: diffusion policy

There is no result