生成时间: 2026-01-12 16:36:35 (UTC+8); Arxiv 发布时间: 2026-01-12 20:00 EST (2026-01-13 09:00 UTC+8)

今天共有 24 篇相关文章

Keyword: reinforcement learning

KP-Agent: Keyword Pruning in Sponsored Search Advertising via LLM-Powered Contextual Bandits

KP-Agent:通过大语言模型驱动的上下文盗贼在赞助搜索广告中的关键词修剪

On the Limits of Self-Improving in LLMs and Why AGI, ASI and the Singularity Are Not Near Without Symbolic Model Synthesis

关于大型语言模型自我改进的极限,以及为什么没有符号模型综合,AGI、ASI和奇点并不接近

Thinking with Map: Reinforced Parallel Map-Augmented Agent for Geolocalization

以地图思考:强化并行地图增强代理用于地理定位

Do LLMs Need Inherent Reasoning Before Reinforcement Learning? A Study in Korean Self-Correction

LLM在强化学习之前是否需要内在推理?韩国自我纠正研究

PRISMA: Reinforcement Learning Guided Two-Stage Policy Optimization in Multi-Agent Architecture for Open-Domain Multi-Hop Question Answering

PRISMA:强化学习引导的开放域多跳问答中多代理架构中的两阶段策略优化

MaxCode: A Max-Reward Reinforcement Learning Framework for Automated Code Optimization

MaxCode:一个用于自动代码优化的最大奖励强化学习框架

MemBuilder: Reinforcing LLMs for Long-Term Memory Construction via Attributed Dense Rewards

MemBuilder:通过归属密集奖励强化长期记忆构建的大型语言模型

How Exploration Breaks Cooperation in Shared-Policy Multi-Agent Reinforcement Learning

探索如何破坏共享策略多智能体强化学习中的合作

LEAPS: An LLM-Empowered Adaptive Plugin for Taobao AI Search

LEAPS:一款由大语言模型赋能的自适应钓鱼 AI 搜索插件

WildSci: Advancing Scientific Reasoning from In-the-Wild Literature

WildSci:从野外文献中推进科学推理

Reinforcement Learning of Large Language Models for Interpretable Credit Card Fraud Detection

大型语言模型的强化学习以实现可解释的信用卡欺诈检测

PaCoRe: Learning to Scale Test-Time Compute with Parallel Coordinated Reasoning

PaCoRe:学习用并行协调推理扩展测试时间计算

Orchestrating Tokens and Sequences: Dynamic Hybrid Policy Optimization for RLVR

令牌与序列编排:RLVR的动态混合策略优化

Dual-Phase LLM Reasoning: Self-Evolved Mathematical Frameworks

双阶段大型语言模型推理:自我演化的数学框架

GIFT: Games as Informal Training for Generalizable LLMs

GIFT:作为可通用大型语言模型非正式训练的游戏

SketchVL: Policy Optimization via Fine-Grained Credit Assignment for Chart Understanding and More

SketchVL:通过细粒度信用赋值进行策略优化,用于图表理解及更多内容

From Off-Policy to On-Policy: Enhancing GUI Agents via Bi-level Expert-to-Policy Assimilation

从非策略到启策略:通过双级专家到策略同化提升图形界面代理

EnvScaler: Scaling Tool-Interactive Environments for LLM Agent via Programmatic Synthesis

环境扩展器:通过程序化综合实现LLM代理的扩展工具交互环境

Intelligent Singularity Avoidance in UR10 Robotic Arm Path Planning Using Hybrid Fuzzy Logic and Reinforcement Learning

UR10机械臂路径规划中的智能奇点规避,结合混合模糊逻辑与强化学习

IIB-LPO: Latent Policy Optimization via Iterative Information Bottleneck

IIB-LPO:通过迭代信息瓶颈实现潜在策略优化

StackPlanner: A Centralized Hierarchical Multi-Agent System with Task-Experience Memory Management

StackPlanner:一个集中式分层多代理系统,具备任务体验内存管理功能

TowerMind: A Tower Defence Game Learning Environment and Benchmark for LLM as Agents

TowerMind:一个塔防游戏学习环境及作为代理的大型语言模型基准

Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards

证据链化:深度搜索代理的强健强化学习,具备引用感知评分标准奖励

Keyword: diffusion policy

CHDP: Cooperative Hybrid Diffusion Policies for Reinforcement Learning in Parameterized Action Space

CHDP:参数化行动空间中合作混合扩散策略用于强化学习