生成时间: 2025-10-22 16:33:10 (UTC+8); Arxiv 发布时间: 2025-10-22 20:00 EDT (2025-10-23 08:00 UTC+8)

今天共有 46 篇相关文章

Keyword: reinforcement learning

Quantum-Driven State-Reduction for Reliable UAV Trajectory Optimization in Low-Altitude Networks

量子驱动状态还原,实现低空网络中可靠的无人机轨迹优化

DRL-Based Resource Allocation for Energy-Efficient IRS-Assisted UAV Spectrum Sharing Systems

基于DRL的节能IRS辅助无人机频谱共享系统的资源分配

POPI: Personalizing LLMs via Optimized Natural Language Preference Inference

POPI:通过优化的自然语言偏好推理个性化 LLM

TritonRL: Training LLMs to Think and Code Triton Without Cheating

TritonRL:训练法学硕士在不作弊的情况下思考和编码 Triton

Self-Evidencing Through Hierarchical Gradient Decomposition: A Dissipative System That Maintains Non-Equilibrium Steady-State by Minimizing Variational Free Energy

通过分层梯度分解进行自我证明:通过最小化变分自由能来维持非平衡稳态的耗散系统

CLAWS:Creativity detection for LLM-generated solutions using Attention Window of Sections

CLAWS:使用部分注意力窗口对 LLM 生成的解决方案进行创造力检测

Rewarding the Journey, Not Just the Destination: A Composite Path and Answer Self-Scoring Reward Mechanism for Test-Time Reinforcement Learning

奖励旅程,而不仅仅是目的地:测试时间强化学习的复合路径和答案自评分奖励机制

EvoSyn: Generalizable Evolutionary Data Synthesis for Verifiable Learning

EvoSyn:用于可验证学习的可推广进化数据合成

UniRL-Zero: Reinforcement Learning on Unified Models with Joint Language Model and Diffusion Model Experts

UniRL-Zero:联合语言模型和扩散模型专家对统一模型进行强化学习

Humanoid Goalkeeper: Learning from Position Conditioned Task-Motion Constraints

人形守门员:从位置条件任务运动约束中学习

OPTAGENT: Optimizing Multi-Agent LLM Interactions Through Verbal Reinforcement Learning for Enhanced Reasoning

OPTAGENT:通过语言强化学习优化多智能体 LLM 交互以增强推理

Adaptive Divergence Regularized Policy Optimization for Fine-tuning Generative Models

用于微调生成模型的自适应散度正则化策略优化

SPACeR: Self-Play Anchoring with Centralized Reference Models

SPACeR:使用集中式参考模型进行自我定位

R2L: Reliable Reinforcement Learning: Guaranteed Return & Reliable Policies in Reinforcement Learning

R2L:可靠的强化学习:强化学习中的保证回报和可靠策略

Provably Optimal Reinforcement Learning under Safety Filtering

安全滤波下可证明的最优强化学习

RL-Driven Security-Aware Resource Allocation Framework for UAV-Assisted O-RAN

RL驱动的无人机辅助O-RAN安全感知资源分配框架

LLMs Encode How Difficult Problems Are

LLM 编码问题的难度

Local Coherence or Global Validity? Investigating RLVR Traces in Math Domains

局部一致性还是全球有效性?研究数学领域的 RLVR 跟踪

Nash Policy Gradient: A Policy Gradient Method with Iteratively Refined Regularization for Finding Nash Equilibria

纳什策略梯度:一种基于迭代细化正则化的策略梯度方法,用于寻找纳什均衡

NTKMTL: Mitigating Task Imbalance in Multi-Task Learning from Neural Tangent Kernel Perspective

NTKMTL:从神经切线核视角缓解多任务学习中的任务不平衡

From Competition to Synergy: Unlocking Reinforcement Learning for Subject-Driven Image Generation

从竞争到协同:解锁主题驱动图像生成的强化学习

Food4All: A Multi-Agent Framework for Real-time Free Food Discovery with Integrated Nutritional Metadata

Food4All:一个多代理框架,用于实时免费发现食物,具有集成的营养元数据

Proactive Reasoning-with-Retrieval Framework for Medical Multimodal Large Language Models

面向医疗多模态大语言模型的主动推理检索框架

Higher Embedding Dimension Creates a Stronger World Model for a Simple Sorting Task

更高的嵌入维度为简单的排序任务创建更强大的世界模型

Why Policy Gradient Algorithms Work for Undiscounted Total-Reward MDPs

为什么策略梯度算法适用于未贴现的总奖励 MDP

PGTT: Phase-Guided Terrain Traversal for Perceptive Legged Locomotion

PGTT:用于感知腿运动的相位引导地形穿越

Ranking-based Preference Optimization for Diffusion Models from Implicit User Feedback

基于隐式用户反馈的扩散模型基于排名的偏好优化

MENTOR: A Reinforcement Learning Framework for Model Enhancement via Teacher-Optimized Rewards in Small Models

MENTOR:在小模型中通过教师优化奖励增强模型的强化学习框架

On AI Verification in Open RAN

关于Open RAN中的AI验证

DeLoad: Demand-Driven Short-Video Preloading with Scalable Watch-Time Estimation

DeLoad:需求驱动的短视频预加载,具有可扩展的观看时间估计功能

CodeRL+: Improving Code Generation via Reinforcement with Execution Semantics Alignment

CodeRL+:通过执行语义对齐的强化改进代码生成

Safe But Not Sorry: Reducing Over-Conservatism in Safety Critics via Uncertainty-Aware Modulation

安全但不后悔:通过不确定性感知调制减少安全批评者的过度保守主义

Socialized Learning and Emergent Behaviors in Multi-Agent Systems based on Multimodal Large Language Models

基于多模态大语言模型的多智能体系统中的社会化学习与涌现行为

Efficient Model-Based Reinforcement Learning for Robot Control via Online Learning

基于模型的高效在线学习机器人控制强化学习

Deep Q-Learning Assisted Bandwidth Reservation for Multi-Operator Time-Sensitive Vehicular Networking

深度Q学习辅助带宽预留,用于多运营商时间敏感车联网

Sherlock Your Queries: Learning to Ask the Right Questions for Dialogue-Based Retrieval

夏洛克你的查询:学习提出正确的问题以进行基于对话的检索

Reinforcement Learning with Imperfect Transition Predictions: A Bellman-Jensen Approach

具有不完美过渡预测的强化学习:Bellman-Jensen 方法

Preference-based Reinforcement Learning beyond Pairwise Comparisons: Benefits of Multiple Options

超越成对比较的基于偏好的强化学习:多种选择的好处

Verifiable Accuracy and Abstention Rewards in Curriculum RL to Alleviate Lost-in-Conversation

课程 RL 中可验证的准确性和弃权奖励,以减少对话中的丢失

WebSeer: Training Deeper Search Agents through Reinforcement Learning with Self-Reflection

WebSeer:通过自我反思的强化学习训练更深层次的搜索代理

Online SFT for LLM Reasoning: Surprising Effectiveness of Self-Tuning without Rewards

用于 LLM 推理的在线 SFT:无奖励的自我调整的惊人效果

Search Self-play: Pushing the Frontier of Agent Capability without Supervision

搜索自玩:在没有监督的情况下推动代理能力的前沿

Actor-Free Continuous Control via Structurally Maximizable Q-Functions

通过结构上可最大化的 Q 函数实现无 Actor 连续控制

Towards Faithful and Controllable Personalization via Critique-Post-Edit Reinforcement Learning

通过批判性编辑后强化学习实现忠实且可控的个性化

EffiReasonTrans: RL-Optimized Reasoning for Code Translation

EffiReasonTrans:RL 优化的代码转换推理

Retaining by Doing: The Role of On-Policy Data in Mitigating Forgetting

边做边留:政策数据在减少遗忘方面的作用

Keyword: diffusion policy

There is no result