生成时间: 2025-10-09 16:28:55 (UTC+8); Arxiv 发布时间: 2025-10-09 20:00 EDT (2025-10-10 08:00 UTC+8)

今天共有 39 篇相关文章

Keyword: reinforcement learning

MCCE: A Framework for Multi-LLM Collaborative Co-Evolution

MCCE:多法学硕士协作共同进化的框架

General and Efficient Visual Goal-Conditioned Reinforcement Learning using Object-Agnostic Masks

使用与对象无关的掩码进行通用且高效的视觉目标条件强化学习

Monte Carlo Permutation Search

蒙特卡洛排列搜索

Attention-Enhanced Reinforcement Learning for Dynamic Portfolio Optimization

用于动态投资组合优化的注意力增强强化学习

Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels

Webscale-RL:用于将 RL 数据扩展到预训练级别的自动化数据管道

Beneficial Reasoning Behaviors in Agentic Search and Effective Post-training to Obtain Them

代理搜索中的有益推理行为和获得它们的有效后期训练

Scalable Policy-Based RL Algorithms for POMDPs

POMDP的可扩展策略RL算法

Incoherence in goal-conditioned autoregressive models

目标条件自回归模型中的不连贯性

The Markovian Thinker

马尔可夫思想家

Aligning Large Language Models via Fully Self-Synthetic Data

通过完全自合成的数据对齐大型语言模型

PIKA: Expert-Level Synthetic Datasets for Post-Training Alignment from Scratch

PIKA:用于从头开始训练后对齐的专家级合成数据集

XRPO: Pushing the limits of GRPO with Targeted Exploration and Exploitation

XRPO:通过有针对性的勘探和开发突破 GRPO 的极限

REACH: Reinforcement Learning for Adaptive Microservice Rescheduling in the Cloud-Edge Continuum

REACH:云边缘连续体中自适应微服务重新调度的强化学习

RLinf-VLA: A Unified and Efficient Framework for VLA+RL Training

RLinf-VLA:VLA+RL训练的统一高效框架

Dual Goal Representations

双目标表示

Scaling LLM Multi-turn RL with End-to-end Summarization-based Context Management

通过基于端到端摘要的上下文管理扩展 LLM 多轮 RL

AWM: Accurate Weight-Matrix Fingerprint for Large Language Models

AWM:用于大型语言模型的准确权重矩阵指纹

Verifying Memoryless Sequential Decision-making of Large Language Models

验证大型语言模型的无记忆顺序决策

TTRV: Test-Time Reinforcement Learning for Vision Language Models

TTRV:视觉语言模型的测试时间强化学习

$λ$-GRPO: Unifying the GRPO Frameworks with Learnable Token Preferences

$λ$-GRPO:将 GRPO 框架与可学习的代币偏好统一起来

SaFeR-VLM: Toward Safety-aware Fine-grained Reasoning in Multimodal Models

SaFeR-VLM:在多模态模型中走向安全感知的细粒度推理

Multi-Dimensional Autoscaling of Stream Processing Services on Edge Devices

边缘设备上流处理服务的多维度自动伸缩

Falsification-Driven Reinforcement Learning for Maritime Motion Planning

伪造驱动的海上运动规划强化学习

No MoCap Needed: Post-Training Motion Diffusion Models with Reinforcement Learning using Only Textual Prompts

无需动作捕捉:仅使用文本提示进行强化学习的训练后运动扩散模型

Diffusing Trajectory Optimization Problems for Recovery During Multi-Finger Manipulation

多指作期间恢复的漫射轨迹优化问题

Tool-Augmented Policy Optimization: Synergizing Reasoning and Adaptive Tool Use with Reinforcement Learning

工具增强策略优化:推理和自适应工具使用与强化学习的协同作用

Search-R3: Unifying Reasoning and Embedding Generation in Large Language Models

Search-R3:在大型语言模型中统一推理和嵌入生成

Sampling Strategies for Robust Universal Quadrupedal Locomotion Policies

稳健通用四足运动策略的采样策略

The Contingencies of Physical Embodiment Allow for Open-Endedness and Care

身体体现的偶然性允许开放式和护理

DPL: Depth-only Perceptive Humanoid Locomotion via Realistic Depth Synthesis and Cross-Attention Terrain Reconstruction

DPL:通过真实深度合成和交叉注意力地形重建的仅深度感知人形运动

Reasoning for Hierarchical Text Classification: The Case of Patents

分层文本分类的推理:以专利为例

HyPlan: Hybrid Learning-Assisted Planning Under Uncertainty for Safe Autonomous Driving

HyPlan:不确定性下的混合学习辅助规划,实现安全自动驾驶

Customer-R1: Personalized Simulation of Human Behaviors via RL-based LLM Agent in Online Shopping

Customer-R1:通过基于RL的LLM代理在网上购物中对人类行为进行个性化模拟

Red-Bandit: Test-Time Adaptation for LLM Red-Teaming via Bandit-Guided LoRA Experts

Red-Bandit:通过 Bandit 指导的 LoRA 专家对 LLM 红队进行测试时间调整

Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense

混合强化:奖励稀疏时,最好是密集的

Test-Time Graph Search for Goal-Conditioned Reinforcement Learning

目标条件强化学习的测试时间图搜索

Online Rubrics Elicitation from Pairwise Comparisons

从成对比较中提取在线评分标准

Evolutionary Profiles for Protein Fitness Prediction

蛋白质适应性预测的进化概况

Keyword: diffusion policy

No MoCap Needed: Post-Training Motion Diffusion Models with Reinforcement Learning using Only Textual Prompts

无需动作捕捉:仅使用文本提示进行强化学习的训练后运动扩散模型