生成时间: 2025-10-21 16:32:13 (UTC+8); Arxiv 发布时间: 2025-10-21 20:00 EDT (2025-10-22 08:00 UTC+8)

今天共有 61 篇相关文章

Keyword: reinforcement learning

DiffPlace: A Conditional Diffusion Framework for Simultaneous VLSI Placement Beyond Sequential Paradigms

DiffPlace:超越顺序范式的同步 VLSI 放置的条件扩散框架

Cog-Rethinker: Hierarchical Metacognitive Reinforcement Learning for LLM Reasoning

Cog-Rethinker:用于 LLM 推理的分层元认知强化学习

Can GRPO Help LLMs Transcend Their Pretraining Origin?

GRPO 能否帮助 LLM 超越其预训练起源?

Using Kolmogorov-Smirnov Distance for Measuring Distribution Shift in Machine Learning

使用柯尔莫哥洛夫-斯米尔诺夫距离测量机器学习中的分布偏移

Transfer learning strategies for accelerating reinforcement-learning-based flow control

加速基于强化学习的流控制的迁移学习策略

Airfoil optimization using Design-by-Morphing with minimized design-space dimensionality

使用变形设计进行翼型优化,设计空间维度最小化

Feature-driven reinforcement learning for photovoltaic in continuous intraday trading

连续日内交易中光伏的特征驱动强化学习

RoBCtrl: Attacking GNN-Based Social Bot Detectors via Reinforced Manipulation of Bots Control Interaction

RoBCtrl:通过加强纵机器人控制交互来攻击基于 GNN 的社交机器人检测器

PrivacyPAD: A Reinforcement Learning Framework for Dynamic Privacy-Aware Delegation

PrivacyPAD:用于动态隐私感知委派的强化学习框架

Zero-shot World Models via Search in Memory

通过内存搜索的零样本世界模型

A Minimal-Assumption Analysis of Q-Learning with Time-Varying Policies

时变策略Q学习的最小假设分析

Alignment is Localized: A Causal Probe into Preference Layers

对齐是本地化的:对偏好层的因果探测

The Formalism-Implementation Gap in Reinforcement Learning Research

强化学习研究中的形式主义-实施差距

Expressive Reward Synthesis with the Runtime Monitoring Language

使用运行时监控语言进行表达性奖励综合

Human-Allied Relational Reinforcement Learning

人与人关系强化学习

WEBSERV: A Browser-Server Environment for Efficient Training of Reinforcement Learning-based Web Agents at Scale

WEBSERV:用于大规模高效训练基于强化学习的 Web 代理的浏览器-服务器环境

Distractor Injection Attacks on Large Reasoning Models: Characterization and Defense

对大型推理模型的干扰器注入攻击:表征与防御

RL makes MLLMs see better than SFT

RL 使 MLLM 比 SFT 看得更好

Call-Center Staff Scheduling Considering Performance Evolution under Emotional Stress

考虑情绪压力下绩效演变的呼叫中心员工调度

SSL4RL: Revisiting Self-supervised Learning as Intrinsic Reward for Visual-Language Reasoning

SSL4RL:重新审视自监督学习作为视觉语言推理的内在奖励

RAVEN: Robust Advertisement Video Violation Temporal Grounding via Reinforcement Reasoning

RAVEN:通过强化推理进行鲁棒广告视频违规时间基础

Buzz, Choose, Forget: A Meta-Bandit Framework for Bee-Like Decision Making

嗡嗡声、选择、忘记:用于蜜蜂式决策的元强盗框架

NP-Engine: Empowering Optimization Reasoning in Large Language Models with Verifiable Synthetic NP Problems

NP-Engine:通过可验证的合成NP问题赋能大型语言模型的优化推理

LANPO: Bootstrapping Language and Numerical Feedback for Reinforcement Learning in LLMs

LANPO:用于法学硕士强化学习的引导语言和数值反馈

Urban-R1: Reinforced MLLMs Mitigate Geospatial Biases for Urban General Intelligence

Urban-R1:增强的多轨多轨管理装置减轻了城市通用智能的地理空间偏差

Count Counts: Motivating Exploration in LLM Reasoning with Count-based Intrinsic Rewards

计数计数:通过基于计数的内在奖励激发 LLM 推理的探索

Beyond Pipelines: A Survey of the Paradigm Shift toward Model-Native Agentic AI

超越管道:向模型原生代理人工智能的范式转变调查

A Comprehensive Survey on Reinforcement Learning-based Agentic Search: Foundations, Roles, Optimizations, Evaluations, and Applications

基于强化学习的智能搜索综合综述:基础、角色、优化、评估和应用

A Control-Theoretic Approach to Dynamic Payment Routing for Success Rate Optimization

用于成功率优化的动态支付路由的控制论方法

VAGEN: Reinforcing World Model Reasoning for Multi-Turn VLM Agents

VAGEN:强化多轮VLM代理的世界模型推理

Towards Context-aware Reasoning-enhanced Generative Searching in E-commerce

在电子商务中迈向上下文感知推理增强的生成式搜索

Prompt-MII: Meta-Learning Instruction Induction for LLMs

Prompt-MII:法学硕士的元学习教学归纳

A Comparative User Evaluation of XRL Explanations using Goal Identification

使用目标识别对 XRL 解释进行比较用户评估

Online Learning Defense against Iterative Jailbreak Attacks via Prompt Optimization

在线学习:通过提示优化防御迭代越狱攻击

SafeSearch: Do Not Trade Safety for Utility in LLM Search Agents

安全搜索:不要在 LLM 搜索代理中以安全换取实用性

Hephaestus: Mixture Generative Modeling with Energy Guidance for Large-scale QoS Degradation

Hephaestus:大规模QoS降解的混合生成建模与能量引导

Video Reasoning without Training

无需培训的视频推理

The Ends Justify the Thoughts: RL-Induced Motivated Reasoning in LLMs

目的证明了这些想法的合理性:法学硕士中 RL 诱导的动机推理

Consistent Zero-Shot Imitation with Contrastive Goal Inference

具有对比目标推理的一致零射模仿

Continuous Q-Score Matching: Diffusion Guided Reinforcement Learning for Continuous-Time Control

连续 Q 分数匹配:用于连续时间控制的扩散引导强化学习

Rethinking On-policy Optimization for Query Augmentation

重新思考查询增强的策略优化

GACO-CAD: Geometry-Augmented and Conciseness-Optimized CAD Model Generation from Single Image

GACO-CAD:从单张图像生成几何增强和简洁优化的 CAD 模型

D2C-HRHR: Discrete Actions with Double Distributional Critics for High-Risk-High-Return Tasks

D2C-HRHR:高风险高回报任务的双重分布批评者的离散行动

Coinvisor: An RL-Enhanced Chatbot Agent for Interactive Cryptocurrency Investment Analysis

Coinvisor:用于交互式加密货币投资分析的 RL 增强型聊天机器人代理

Multimodal Safety Is Asymmetric: Cross-Modal Exploits Unlock Black-Box MLLMs Jailbreaks

多模态安全是不对称的:跨模态漏洞解锁黑盒 MLLM 越狱

Optimizing Energy Management of Smart Grid using Reinforcement Learning aided by Surrogate models built using Physics-informed Neural Networks

使用使用物理知情神经网络构建的代理模型辅助的强化学习优化智能电网的能源管理

TabR1: Taming GRPO for tabular reasoning LLMs

TabR1:驯服 GRPO 进行表格推理 LLM

Inference of Deterministic Finite Automata via Q-Learning

通过 Q 学习推断确定性有限自动机

Leveraging Group Relative Policy Optimization to Advance Large Language Models in Traditional Chinese Medicine

利用群体相对政策优化推进中医药大语言模型

Agentic Reinforcement Learning for Search is Unsafe

搜索的智能体强化学习不安全

OncoReason: Structuring Clinical Reasoning in LLMs for Robust and Interpretable Survival Prediction

OncoReason:在法学硕士中构建临床推理以实现稳健且可解释的生存预测

An Empirical Study of Lagrangian Methods in Safe Reinforcement Learning

安全强化学习中拉格朗日方法的实证研究

RESample: A Robust Data Augmentation Framework via Exploratory Sampling for Robotic Manipulation

RESample:通过机器人作的探索性采样构建强大的数据增强框架

CrossGuard: Safeguarding MLLMs against Joint-Modal Implicit Malicious Attacks

CrossGuard:保护 MLLM 免受联合模态隐式恶意攻击

A Principle of Targeted Intervention for Multi-Agent Reinforcement Learning

多智能体强化学习的靶向干预原则

QueST: Incentivizing LLMs to Generate Difficult Problems

QueST:激励法学硕士产生难题

Train for Truth, Keep the Skills: Binary Retrieval-Augmented Reward Mitigates Hallucinations

为真理而训练,保持技能:二进制检索增强奖励可减轻幻觉

UltraCUA: A Foundation Model for Computer Use Agents with Hybrid Action

UltraCUA:具有混合动作的计算机使用代理的基础模型

SoftMimic: Learning Compliant Whole-body Control from Examples

SoftMimic:从示例中学习合规全身控制

Foundational Automatic Evaluators: Scaling Multi-Task Generative Evaluator Training for Reasoning-Centric Domains

基础自动评估器:扩展以推理为中心的领域的多任务生成评估器训练

Keyword: diffusion policy

Continuous Q-Score Matching: Diffusion Guided Reinforcement Learning for Continuous-Time Control

连续 Q 分数匹配:用于连续时间控制的扩散引导强化学习