生成时间: 2025-10-23 16:30:31 (UTC+8); Arxiv 发布时间: 2025-10-23 20:00 EDT (2025-10-24 08:00 UTC+8)

今天共有 30 篇相关文章

Keyword: reinforcement learning

CosmoCore Affective Dream-Replay Reinforcement Learning for Code Generation

CosmoCore 情感梦境重播强化学习用于代码生成

ADPO: Anchored Direct Preference Optimization

ADPO:锚定直接偏好优化

Noise-corrected GRPO: From Noisy Rewards to Unbiased Gradients

噪声校正 GRPO:从嘈杂的奖励到无偏的梯度

BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping

BAPO:通过自适应裁剪的平衡策略优化稳定法学硕士的策略外强化学习

REPAIR Approach for Social-based City Reconstruction Planning in case of natural disasters

自然灾害时以社会为基础的城市重建规划的REPAIR方法

Rectifying Shortcut Behaviors in Preference-based Reward Learning

纠正基于偏好的奖励学习中的捷径行为

POLAR: Policy-based Layerwise Reinforcement Learning Method for Stealthy Backdoor Attacks in Federated Learning

POLAR:联邦学习中隐身后门攻击的基于策略的逐层强化学习方法

A Communication-Efficient Decentralized Actor-Critic Algorithm

一种通信高效的去中心化行为者-批评者算法

RLBoost: Harvesting Preemptible Resources for Cost-Efficient Reinforcement Learning on LLMs

RLBoost:在法学硕士上收集具有成本效益的强化学习的抢占式资源

SPOT: Scalable Policy Optimization with Trees for Markov Decision Processes

SPOT:马尔可夫决策过程的树的可扩展策略优化

Interpret Policies in Deep Reinforcement Learning using SILVER with RL-Guided Labeling: A Model-level Approach to High-dimensional and Multi-action Environments

使用 SILVER 和 RL 引导标记解释深度强化学习中的策略:高维和多动作环境的模型级方法

Hierarchical DLO Routing with Reinforcement Learning and In-Context Vision-language Models

使用强化学习和上下文视觉语言模型的分层 DLO 路由

Managing Charging Induced Grid Stress and Battery Degradation in Electric Taxi Fleets

管理电动出租车车队中充电引起的电网压力和电池退化

QiMeng-SALV: Signal-Aware Learning for Verilog Code Generation

QiMeng-SALV:用于Verilog代码生成的信号感知学习

Unified Reinforcement and Imitation Learning for Vision-Language Models

视觉语言模型的统一强化和模仿学习

Continual Knowledge Adaptation for Reinforcement Learning

强化学习的持续知识适应

Balancing Rewards in Text Summarization: Multi-Objective Reinforcement Learning via HyperVolume Optimization

平衡文本摘要中的奖励:通过超容量优化进行多目标强化学习

Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning

每一个注意力都很重要:用于长上下文推理的高效混合架构

A Markov Decision Process for Variable Selection in Branch & Bound

分支和边界中变量选择的马尔可夫决策过程

Autobidding Arena: unified evaluation of the classical and RL-based autobidding algorithms

Autobidding Arena:对经典和基于RL的自动竞价算法进行统一评估

LoongRL:Reinforcement Learning for Advanced Reasoning over Long Contexts

LoongRL:长上下文高级推理强化学习

ColorAgent: Building A Robust, Personalized, and Interactive OS Agent

ColorAgent:构建强大、个性化和交互式的作系统代理

Reasoning Like Experts: Leveraging Multimodal Large Language Models for Drawing-based Psychoanalysis

像专家一样推理:利用多模态大语言模型进行基于绘图的精神分析

Using Non-Expert Data to Robustify Imitation Learning via Offline Reinforcement Learning

使用非专家数据通过离线强化学习实现模仿学习的鲁棒化

The Confusing Instance Principle for Online Linear Quadratic Control

在线线性二次控制的混淆实例原理

MedReason-R1: Learning to Reason for CT Diagnosis with Reinforcement Learning and Local Zoom

MedReason-R1:通过强化学习和局部变焦学习推理 CT 诊断

Memo: Training Memory-Efficient Embodied Agents with Reinforcement Learning

备忘录:使用强化学习训练记忆高效的具身智能体

SEA: Semantic Map Prediction for Active Exploration of Uncertain Areas

SEA:主动探索不确定区域的语义图预测

Scaf-GRPO: Scaffolded Group Relative Policy Optimization for Enhancing LLM Reasoning

Scaf-GRPO:用于增强 LLM 推理的脚手架组相对策略优化

olmOCR 2: Unit Test Rewards for Document OCR

olmOCR 2:文档 OCR 的单元测试奖励

Keyword: diffusion policy

There is no result