Skip to content

LLM

LLM

Publish Date Title Authors Homepage Code
2026-05-15 IVGT: Implicit Visual Geometry Transformer for Neural Scene Representation Yuqi Wu et.al. 2605.16258v1 null
2026-05-15 Designing Datacenter Power Delivery Hierarchies for the AI Era Grant Wilkins et.al. 2605.16255v1 null
2026-05-15 A Generative AI Framework for Intelligent Utility Billing CO 2 Analytics and Sustainable Resource Optimisation Pavan Manjunath et.al. 2605.16250v1 null
2026-05-15 AI-Mediated Communication Can Steer Collective Opinion Stratis Tsirtsis et.al. 2605.16245v1 null
2026-05-15 Offline Semantic Guidance for Efficient Vision-Language-Action Policy Distillation Jin Shi et.al. 2605.16241v1 null
2026-05-15 Prospective multi-pathogen disease forecasting using autonomous LLM-guided tree search Sarah Martinson et.al. 2605.16238v1 null
2026-05-15 FORGE: Self-Evolving Agent Memory With No Weight Updates via Population Broadcast Igor Bogdanov et.al. 2605.16233v1 null
2026-05-15 Evaluating Design Video Generation: Metrics for Compositional Fidelity Adrienne Deganutti et.al. 2605.16223v1 null
2026-05-15 Artificial Aphasias in Lesioned Language Models Nathan Roll et.al. 2605.16222v1 null
2026-05-15 Argus: Evidence Assembly for Scalable Deep Research Agents Zhen Zhang et.al. 2605.16217v1 null
2026-05-15 Fully Open Meditron: An Auditable Pipeline for Clinical LLMs Xavier Theimer-Lienhard et.al. 2605.16215v1 null
2026-05-15 Confirming Correct, Missing the Rest: LLM Tutoring Agents Struggle Where Feedback Matters Most Tahreem Yasir et.al. 2605.16207v1 null
2026-05-15 Context, Reasoning, and Hierarchy: A Cost-Performance Study of Compound LLM Agent Design in an Adversarial POMDP Igor Bogdanov et.al. 2605.16205v1 null
2026-05-15 Formal Methods Meet LLMs: Auditing, Monitoring, and Intervention for Compliance of Advanced AI Systems Parand A. Alamdari et.al. 2605.16198v1 null
2026-05-15 paper.json: A Coordination Convention for LLM-Agent-Actionable Papers Arquimedes Canedo et.al. 2605.16194v1 null
2026-05-15 Improving Cross-Cultural Survey Simulation with Calibrated Value Personas Axel Abels et.al. 2605.16193v1 null
2026-05-15 Optimized Three-Dimensional Photovoltaic Structures with LLM guided Tree Search Michael P. Brenner et.al. 2605.16191v1 null
2026-05-15 Second-Order Multi-Level Variance Correction for Modality Competition in Multimodal Models Yishun Lu et.al. 2605.16165v1 null
2026-05-15 An Algebraic Exposition of the Theory of Dyadic Morality Kush R. Varshney et.al. 2605.16153v1 null
2026-05-15 Look Before You Leap: Autonomous Exploration for LLM Agents Ziang Ye et.al. 2605.16143v1 null
2026-05-15 Property-Guided LLM Program Synthesis for Planning Augusto B. Corrêa et.al. 2605.16142v1 null
2026-05-15 Surrogate Neural Architecture Codesign Package (SNAC-Pack) Jason Weitz et.al. 2605.16138v1 null
2026-05-15 Navigating Potholes with Geometry-Aware Sharpness Minimization Simon Dufort-Labbé et.al. 2605.16134v1 null
2026-05-15 Entropy Across the Bridge: Conditional-Marginal Discretization for Flow and Schrödinger Samplers Bruno Trentini et.al. 2605.16126v1 null
2026-05-15 SGR: A Stepwise Reasoning Framework for LLMs with External Subgraph Generation Xin Zhang et.al. 2605.16117v1 null
2026-05-15 DebiasRAG: A Tuning-Free Path to Fair Generation in Large Language Models through Retrieval-Augmented Generation Rui Chu et.al. 2605.16113v1 null
2026-05-15 Multi-Level Contextual Token Relation Modeling for Machine-Generated Text Detection Chenwang Wu et.al. 2605.16107v1 null
2026-05-15 GeoGS-CE: Learning Delay--Beam Channel Priors with 3D Gaussians for High-Mobility Scenarios Yumeng Zhang et.al. 2605.16094v1 null
2026-05-15 Centralized vs Decentralized Federated Learning: A trade-off performance analysis Chaimaa Medjadji et.al. 2605.16089v1 null
2026-05-15 Towards Trustworthy and Explainable AI for Perception Models: From Concept to Prototype Vehicle Deployment Till Beemelmanns et.al. 2605.16087v1 null
2026-05-15 Towards Foundation Models for Relational Databases with Language Models and Graph Neural Networks Jingcheng Wu et.al. 2605.16085v1 null
2026-05-15 VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation Yiming Zhao et.al. 2605.16079v1 null
2026-05-15 Can Large Language Models Imitate Human Speech for Clinical Assessment? LLM-Driven Data Augmentation for Cognitive Score Prediction Si-Belkacem Yamine Ketir et.al. 2605.16077v1 null
2026-05-15 AgriMind: An Ensemble Deep Learning Framework for Multi-Class Plant Disease Classification Salma Hoque Talukdar Koli et.al. 2605.16076v1 null
2026-05-15 Robust Prior-Guided Segmentation for Editable 3D Gaussian Splatting Raushan Joshi et.al. 2605.16065v1 null
2026-05-15 Misspecified Explore-then-Exploit Leads to Supra-Competitive Prices Jackie Baek et.al. 2605.16064v1 null
2026-05-15 Ada-Diffuser: Latent-Aware Adaptive Diffusion for Decision-Making Fan Feng et.al. 2605.16054v1 null
2026-05-15 Reasoners or Translators? Contamination-aware Evaluation and Neuro-Symbolic Robustness in Tax Law Parisa Kordjamshidi et.al. 2605.16052v1 null
2026-05-15 Looped SSMs: Depth-Recurrence and Input Reshaping for Time Series Classification Mónika Farsang et.al. 2605.16048v1 null
2026-05-15 XSearch: Explainable Code Search via Concept-to-Code Alignment Yiming Liu et.al. 2605.16046v1 null
2026-05-15 RecMem: Recurrence-based Memory Consolidation for Efficient and Effective Long-Running LLM Agents Zijie Dai et.al. 2605.16045v1 null
2026-05-15 Who Owns This Agent? Tracing AI Agents Back to Their Owners Ruben Chocron et.al. 2605.16035v1 null
2026-05-15 From Flat Language Labels to Typological Priors: Structured Language Conditioning for Multilingual Speech-to-Speech Translation Yu Pan et.al. 2605.16026v1 null
2026-05-15 Judge Circuits Nils Feldhus et.al. 2605.16023v1 null
2026-05-15 Can Vision Language Models Be Adaptive in Mathematics Education? A Learner Model-based Rubric Study Jie Gao et.al. 2605.16011v1 null
2026-05-15 CitePrism: Human-in-the-Loop AI for Citation Auditing and Editorial Integrity Gowrika Mahesh et.al. 2605.16000v1 null
2026-05-15 Constrained latent state modeling: A unifying perspective on representation learning under competing constraints Gwenolé Quellec et.al. 2605.15995v1 null
2026-05-15 Defining Cultural Capabilities for AI Evaluation: A Taxonomy Grounded in Intercultural Communication Theory Isar Nejadgholi et.al. 2605.15990v1 null
2026-05-15 Beyond Content: A Comprehensive Speech Toxicity Dataset and Detection Framework Incorporating Paralinguistic Cues Zhongjie Ba et.al. 2605.15984v1 null
2026-05-15 Ontology for Policing: Conceptual Knowledge Learning for Semantic Understanding and Reasoning in Law Enforcement Reports Anita Srbinovska et.al. 2605.15978v1 null
2026-05-15 Reference-Free Reinforcement Learning Fine-Tuning for MT: A Seq2Seq Perspective Ernesto Garcia-Estrada et.al. 2605.15976v1 null
2026-05-15 Learning Bilevel Policies over Symbolic World Models for Long-Horizon Planning Dillon Z. Chen et.al. 2605.15975v1 null
2026-05-15 Deterministic Event-Graph Substrates as World Models for Counterfactual Reasoning Fabio Rovai et.al. 2605.15967v1 null
2026-05-15 PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control Jingxuan Wei et.al. 2605.15963v1 null
2026-05-15 Imperfect World Models are Exploitable Logan Mondal Bhamidipaty et.al. 2605.15960v1 null
2026-05-15 When and Why Adversarial Training Improves PINNs: A Neural Tangent Kernel Perspective Yuan-dong Cao et.al. 2605.15959v1 null
2026-05-15 Decomposed Vision-Language Alignment for Fine-Grained Open-Vocabulary Segmentation Chenhao Wang et.al. 2605.15942v1 null
2026-05-15 LoCO: Low-rank Compositional Rotation Fine-tuning An Nguyen et.al. 2605.15916v1 null
2026-05-15 SLIP & ETHICS: Graduated Intervention for AI Emotional Companions Minseo Kim et.al. 2605.15915v1 null
2026-05-15 Towards Generalization of Block Attention via Automatic Segmentation and Block Distillation Shuaiyi Li et.al. 2605.15913v1 null
2026-05-15 RaPD: Resolution-Agnostic Pixel Diffusion via Semantics-Enriched Implicit Representations Yanhao Ge et.al. 2605.15908v1 null
2026-05-15 Generative Long-term User Interest Modeling for Click-Through Rate Prediction Jiangli Shao et.al. 2605.15905v1 null
2026-05-15 Uncertainty-Aware Wildfire Smoke Density Classification from Satellite Imagery via CBAM-Augmented EfficientNet with Evidential Deep Learning Ranjith Chodavarapu et.al. 2605.15894v1 null
2026-05-15 CHoE: Cross-Domain Heterogeneous Graph Prompt Learning via Structure-Conditioned Experts Peiyuan Li et.al. 2605.15888v1 null
2026-05-15 Linked Multi-Model Data on Russian Domestic and Foreign Policy Speeches Daria Blinova et.al. 2605.15886v1 null
2026-05-15 Symplectic Neural Operators for Learning Infinite Dimensional Hamiltonian Systems Yeang Makara et.al. 2605.15881v1 null
2026-05-15 Agentic Discovery of Neural Architectures: AIRA-Compose and AIRA-Design Alberto Pepe et.al. 2605.15871v1 null
2026-05-15 Are VLMs Seeing or Just Saying? Uncovering the Illusion of Visual Re-examination Chufan Shi et.al. 2605.15864v1 null
2026-05-15 Conversations in Space: Structuring Non-Linear LLM Interactions on a Canvas Rifat Mehreen Amin et.al. 2605.15848v1 null
2026-05-15 RoadmapBench: Evaluating Long-Horizon Agentic Software Development Across Version Upgrades Xinbo Xu et.al. 2605.15846v1 null
2026-05-15 GAP: Geometric Anchor Pre-training for Data-Efficient Visuomotor Learning of Manipulation Tasks Davide Buoso et.al. 2605.15836v1 null
2026-05-15 Modeling Music as a Time-Frequency Image: A 2D Tokenizer for Music Generation Yuqing Cheng et.al. 2605.15831v1 null
2026-05-15 Toward Natural and Companionable Virtual Agents via Cross-Temporal Emotional Modeling Feier Qin et.al. 2605.15812v1 null
2026-05-15 ForMaT: Dataset for Visually-Grounded Multilingual PDF Translation Michał Ciesiółka et.al. 2605.15794v1 null
2026-05-15 Grokking as Structural Inference: Transformers Need Bayesian Lottery Tickets Kai Hidajat et.al. 2605.15787v1 null
2026-05-15 SaaS-Bench: Can Computer-Use Agents Leverage Real-World SaaS to Solve Professional Workflows? Kean Shi et.al. 2605.15777v1 null
2026-05-15 ALSO: Adversarial Online Strategy Optimization for Social Agents Xiang Li et.al. 2605.15768v1 null
2026-05-15 GRASP: Learning to Ground Social Reasoning in Multi-Person Non-Verbal Interactions Junho Kim et.al. 2605.15764v1 null
2026-05-15 CompactQE: Interpretable Translation Quality Estimation via Small Open-Weight LLMs Kamil Guttmann et.al. 2605.15763v1 null
2026-05-15 DimMem: Dimensional Structuring for Efficient Long-Term Agent Memory Wentao Qiu et.al. 2605.15759v1 null
2026-05-15 BiomedAP: A Vision-Informed Dual-Anchor Framework with Gated Cross-Modal Fusion for Robust Medical Vision-Language Adaptation Huanyang Tong et.al. 2605.15736v1 null
2026-05-15 UAM: A Dual-Stream Perspective on Forgetting in VLA Training Jianke Zhang et.al. 2605.15735v1 null
2026-05-15 Can We Trust AI-Inferred User States. A Psychometric Framework for Validating the Reliability of Users States Classification by LLMs in Operational Environments Izabella Krzeminska et.al. 2605.15734v1 null
2026-05-15 Structure Abstraction and Generalization in a Hippocampal-Entorhinal Inspired World Model Tianqiu Zhang et.al. 2605.15733v1 null
2026-05-15 DecomPose: Disentangling Cross-Category Optimization Contention for Category-Level 6D Object Pose Estimation Yifan Gao et.al. 2605.15728v1 null
2026-05-15 Nudging Beyond the Comfort Zone: Efficient Strategy-Guided Exploration for RLVR Chanuk Lee et.al. 2605.15726v1 null
2026-05-15 DiLA: Disentangled Latent Action World Models Tianqiu Zhang et.al. 2605.15725v1 null
2026-05-15 Bidirectional Fusion Guided by Cardiac Patterns for Semi-Supervised ECG Segmentation Jeonghwa Lim et.al. 2605.15722v1 null
2026-05-15 Contexting as Recommendation: Evolutionary Collaborative Filtering for Context Engineering Jiachen Zhu et.al. 2605.15721v1 null
2026-05-15 Position: Early-Stage Quality Assurance in Annotation Pipelines Is More Cost-Effective Than Late-Stage Validation Sunil Kothari et.al. 2605.15714v1 null
2026-05-15 Feedback World Model Enables Precise Guidance of Diffusion Policy Tuo An et.al. 2605.15705v1 null
2026-05-15 H-Mem: A Novel Memory Mechanism for Evolving and Retrieving Agent Memory via a Hybrid Structure Jiawei Yu et.al. 2605.15701v1 null
2026-05-15 ASRU: Activation Steering Meets Reinforcement Unlearning for Multimodal Large Language Models Jiahui Guang et.al. 2605.15687v1 null
2026-05-15 Few-Shot Large Language Models for Actionable Triage Categorization of Online Patient Inquiries Liqi Zhou et.al. 2605.15680v1 null
2026-05-15 VCG-Bench: Towards A Unified Visual-Centric Benchmark for Structured Generation and Editing Xiaoyan Su et.al. 2605.15677v1 null
2026-05-15 Dynamic Chunking for Diffusion Language Models Yichen Zhu et.al. 2605.15676v1 null
2026-05-15 Interaction-Aware Influence Functions for Group Attribution Jaeseung Heo et.al. 2605.15675v1 null
2026-05-15 VLMs Trace Without Tracking: Diagnosing Failures in Visual Path Following Hyesoo Hong et.al. 2605.15672v1 null
2026-05-15 PRISM: Prompt Reliability via Iterative Simulation and Monitoring for Enterprise Conversational AI Keshava Chaitanya et.al. 2605.15665v1 null
2026-05-15 VAGS: Velocity Adaptive Guidance Scale for Image Editing and Generation Yan Luo et.al. 2605.15661v1 null

Abstracts

IVGT: Implicit Visual Geometry Transformer for Neural Scene Representation

2605.16258v1 by Yuqi Wu, Tianyu Hu, Wenzhao Zheng, Yuanhui Huang, Haowen Sun, Jie Zhou, Jiwen Lu

Reconstructing coherent 3D geometry and appearance from unposed multi-view images is a fundamental yet challenging problem in computer vision. Most existing visual geometry foundation models predict explicit geometry by regressing pixel-aligned pointmaps, often suffering from redundancy and limited geometric continuity. We propose IVGT, an Implicit Visual Geometry Transformer that implicitly models continuous and coherent geometry from pose-free multi-view images. This formulation learns a continuous neural scene representation in a canonical coordinate system and supports continuous spatial queries at any 3D positions, retrieving local features to predict signed distance (SDF) values and colors using lightweight decoders. It allows direct extraction of continuous and coherent surface geometry, enabling rendering of RGB images, depth maps, and surface normal maps from arbitrary viewpoints. We train IVGT via multi-dataset joint optimization with 2D supervision and 3D geometric regularization. IVGT demonstrates generalization across scenes and achieves strong performance on various tasks, including mesh and point cloud reconstruction, novel view synthesis, depth and surface normal estimation, and camera pose estimation.

摘要:重建從未擺姿勢的多視角圖像中獲得一致的三維幾何形狀和外觀是一個基本但具有挑戰性的計算機視覺問題。大多數現有的視覺幾何基礎模型通過回歸像素對齊的點圖來預測明確的幾何形狀,通常會遭受冗餘和有限的幾何連續性。我們提出了IVGT,一種隱式視覺幾何Transformer,能夠從無姿勢的多視角圖像中隱式建模連續且一致的幾何形狀。這種表述在標準坐標系中學習連續的神經場景表示,並支持在任何三維位置進行連續的空間查詢,檢索局部特徵以使用輕量級解碼器預測帶符號距離(SDF)值和顏色。它允許直接提取連續且一致的表面幾何形狀,從任意視點渲染RGB圖像、深度圖和表面法線圖。我們通過多數據集的聯合優化進行IVGT的訓練,結合了2D監督和3D幾何正則化。IVGT在場景之間展示了良好的泛化能力,並在各種任務上達到了強大的性能,包括網格和點雲重建、新視角合成、深度和表面法線估計以及相機姿態估計。

Designing Datacenter Power Delivery Hierarchies for the AI Era

2605.16255v1 by Grant Wilkins, Fiodar Kazhamiaka, Alok Gautam Kumbhare, Chaojie Zhang, Ricardo Bianchini

Demand for AI accelerators is rapidly increasing rack power density, with projections approaching 1MW per deployment by 2027. This poses a major challenge for datacenter power delivery designers. As power densities increase, a datacenter designed for a different target density may strand power, i.e., may be unable to use all the power that its delivery hierarchy has provisioned. Designs must remain efficient over long datacenter lifetimes and multiple hardware generations. Power utilization is particularly important as grid power capacity is a scarce resource in the AI era. Designing an efficient power delivery hierarchy for the long run is difficult because rack placement feasibility, workload impact, and cost depend jointly on electrical topology, deployment granularity, placement policy, power oversubscription, and workload mix. Moreover, each of these factors evolve over time, have inter-dependencies across multiple resource dimensions, and generally do not lend themselves to closed-form analysis. To address this challenge, we develop a framework for evaluating datacenter power delivery designs using throughput, power, and cost metrics over realistic arrival, oversubscription, and decommissioning sequences. The framework combines projection models for GPU, compute, and storage deployments with operational factors grounded in production data from Microsoft Azure. Our results show that multi-resource stranding materially changes deployable capacity, effective capital expenditure, and delivered performance, and quantify how rising density from rack- and pod-scale AI systems shapes these outcomes. For AI datacenter design, the relevant planning objective is not installed megawatts, but deployable capacity over time.

摘要:對於 AI 加速器的需求正在迅速增加機架功率密度,預測到 2027 年每次部署接近 1MW。這對數據中心電力傳輸設計師提出了重大挑戰。隨著功率密度的增加,為不同目標密度設計的數據中心可能會造成電力浪費,即可能無法使用其傳輸層級所提供的所有電力。設計必須在長期數據中心壽命和多代硬體中保持高效。電力利用率尤為重要,因為在 AI 時代,電網電力容量是一種稀缺資源。
為長期設計高效的電力傳輸層級是困難的,因為機架放置的可行性、工作負載的影響和成本共同依賴於電氣拓撲、部署粒度、放置政策、電力超訂閱和工作負載組合。此外,這些因素隨時間演變,在多個資源維度之間存在相互依賴性,通常不適合進行封閉形式的分析。
為了解決這一挑戰,我們開發了一個框架,用於評估數據中心電力傳輸設計,使用通量、功率和成本指標,基於現實的到達、超訂閱和退役序列。該框架結合了 GPU、計算和存儲部署的預測模型,以及基於 Microsoft Azure 生產數據的運營因素。我們的結果顯示,多資源浪費實質性地改變了可部署容量、有效資本支出和交付性能,並量化了機架和模塊級 AI 系統日益增加的密度如何影響這些結果。對於 AI 數據中心設計,相關的規劃目標不是已安裝的兆瓦數,而是隨時間變化的可部署容量。

A Generative AI Framework for Intelligent Utility Billing CO 2 Analytics and Sustainable Resource Optimisation

2605.16250v1 by Pavan Manjunath, Thomas Pruefer

Distribution utilities are now expected to deliver bills that customers can actually read attach a defensible carbon number to every kWh sold and schedule load against grid stress and emissions constraints We propose an end-to-end framework that unifies four production-grade capabilities under one architectural roof a generative-AI agent that drafts each customers natural-language billing statement from structured numeric inputs under a constrained decoding policy a transformer-based forecaster that supplies the day-ahead consumption estimate with calibrated quantile bands

摘要:配電公用事業現在被期望提供客戶可以實際閱讀的帳單,並為每個售出的千瓦時附上可辯護的碳數字,還要根據電網壓力和排放限制來安排負載。我們提出了一個端到端的框架,統一了四個生產級能力於一個架構之下:一個生成式AI代理,根據受限解碼政策從結構化的數字輸入中起草每位客戶的自然語言帳單;一個基於Transformer的預測器,提供經過校準的分位數帶的前一天消耗預測。

AI-Mediated Communication Can Steer Collective Opinion

2605.16245v1 by Stratis Tsirtsis, Kai Rawal, Chris Russell, Brent Mittelstadt, Sandra Wachter

Generative artificial intelligence (AI) is increasingly integrated into the online platforms where humans exchange opinions; large language models (LLMs) now polish users' posts on LinkedIn and provide context for content shared on X. While prior work has shown that AI can express biased opinions and shape individuals' opinions during human-AI interactions, less attention has been paid to its influence on collective opinion formation when mediating human-to-human communication. We address this gap via a combination of empirical and theoretical analyses. We show empirically that LLMs from multiple popular families introduce directional biases when instructed to edit human-written texts on contested topics, for example, nudging texts in favor of gun control and against atheism. Building on this observation, we introduce a mathematical model of opinion dynamics in which an AI system sits between users on a social network, transforming the opinions they express and perceive. By analytically characterizing the equilibrium of this model and performing simulations on real social network data, we show that biases introduced by AI in human-to-human communication can be amplified through the network and shift collective opinion in their direction. In light of these findings, we investigate whether such biases are controllable by online platforms. We audit the "Explain this post" feature on X and find evidence of pro-life bias in Grok's outputs on abortion-related content, which we trace back to specific design choices. We conclude with a discussion of the broader implications of our findings in relation to ongoing legislative efforts in the European Union.

摘要:生成式人工智慧(AI)越來越多地融入人類交換意見的在線平台;大型語言模型(LLMs)現在為用戶在LinkedIn上的帖子進行潤飾,並為在X上分享的內容提供背景。雖然先前的研究已顯示AI可以表達偏見的意見並在與人類的互動中塑造個體的觀點,但對於其在調解人與人之間的交流時對集體意見形成的影響,關注較少。我們通過實證和理論分析的結合來填補這一空白。我們實證顯示,來自多個流行家族的LLMs在被指示編輯有爭議主題的人類撰寫文本時,引入了方向性偏見,例如,促使文本支持槍支管制並反對無神論。基於這一觀察,我們引入了一個意見動態的數學模型,其中AI系統位於社交網絡上的用戶之間,轉變他們表達和感知的意見。通過分析性地描述該模型的均衡並對真實社交網絡數據進行模擬,我們顯示AI在人與人之間的交流中引入的偏見可以通過網絡被放大,並將集體意見轉向其方向。鑒於這些發現,我們調查了在線平台是否能控制這些偏見。我們審核了X上的“解釋此帖子”功能,並發現Grok在與墮胎相關內容的輸出中存在親生育偏見,這可以追溯到特定的設計選擇。我們最後討論了我們的發現對歐盟正在進行的立法努力的更廣泛影響。

Offline Semantic Guidance for Efficient Vision-Language-Action Policy Distillation

2605.16241v1 by Jin Shi, Brady Zhang, Yishun Lu

Billion-parameter Vision-Language-Action (VLA) policies have recently shown impressive performance in robotic manipulation, yet their size and inference cost remain major obstacles for real-time closed-loop control. We introduce \textbf{VLA-AD}, a distillation framework that uses a Vision-Language Model as an offline semantic supervisor to transfer large VLA teachers into lightweight student policies. Instead of relying only on low-level action imitation, VLA-AD augments teacher-provided 7-DoF action targets with high-level semantic guidance, including task phase anchors and multi-frame operating-direction descriptions. These auxiliary signals are used only during training: at test time, the student policy runs independently, with neither the VLA teacher nor the VLM required. We evaluate VLA-AD on three LIBERO benchmark suites. Using OpenVLA-7B as the teacher, our method produces a 158M-parameter student, yielding a $44\times$ reduction in model size while matching the teacher with only a $0.27\%$ average relative gap. The resulting policy runs at 12.5 Hz on an RTX 4090, achieving a $3.28\times$ inference speedup over OpenVLA-7B. We further show that the same semantic distillation pipeline generalizes to a different $π_{0.5}$-4B teacher, where the student outperforms the teacher on two suites and remains within $0.53\%$ on \texttt{libero_goal}. Additional analysis indicates that phase-level supervision and multi-frame directional cues make the student less sensitive to noisy teacher actions, such as erroneous high-frequency gripper changes. Overall, VLA-AD demonstrates that offline semantic guidance from VLMs can substantially improve the efficiency, robustness, and deployability of VLA policy distillation.

摘要:十億參數的視覺-語言-行動(VLA)政策最近在機器人操作中展現了令人印象深刻的表現,但其大小和推理成本仍然是實時閉環控制的主要障礙。我們介紹了 \textbf{VLA-AD},這是一個蒸餾框架,利用視覺-語言模型作為離線語義監督,將大型 VLA 教師轉換為輕量級學生政策。VLA-AD 不僅依賴於低層次的行動模仿,還通過高層次的語義指導來增強教師提供的 7 自由度行動目標,包括任務階段錨點和多幀操作方向描述。這些輔助信號僅在訓練期間使用:在測試時,學生政策獨立運行,無需 VLA 教師或 VLM。我們在三個 LIBERO 基準套件上評估 VLA-AD。使用 OpenVLA-7B 作為教師,我們的方法生成了一個 158M 參數的學生,實現了模型大小的 $44\times$ 減少,同時與教師的平均相對差距僅為 $0.27\%$。生成的政策在 RTX 4090 上以 12.5 Hz 運行,實現了相較於 OpenVLA-7B 的 $3.28\times$ 推理加速。我們進一步展示了相同的語義蒸餾管道可以推廣到不同的 $π_{0.5}$-4B 教師,學生在兩個套件上超越教師,並在 \texttt{libero_goal} 上保持在 $0.53\%$ 之內。額外的分析表明,階段級監督和多幀方向線索使學生對教師行動的噪音不那麼敏感,例如錯誤的高頻夾爪變化。總體而言,VLA-AD 展示了來自 VLM 的離線語義指導可以顯著改善 VLA 政策蒸餾的效率、穩健性和可部署性。

2605.16238v1 by Sarah Martinson, Michael P. Brenner, Martyna Plomecka, Brian P. Williams, Nicholas G. Reich, Zahra Shamsi

Probabilistic forecasting of infectious diseases is crucial for public health but relies on labor-intensive manual model curation by expert modeling teams. This bespoke development bottlenecks scalability to granular geographic resolutions or emerging pathogens. Here, we present an autonomous system using Large Language Model (LLM)-guided tree search to iteratively generate, evaluate, and optimize executable forecasting software. In a fully prospective, real-time evaluation during the 2025-2026 US respiratory season, the system autonomously discovered methodologically diverse models for influenza, COVID-19, and respiratory syncytial virus (RSV). Aggregating these machine-generated models yielded an ensemble that consistently matched or outperformed the gold-standard, human-curated Centers for Disease Control and Prevention (CDC) hub ensembles out-of-sample. The system successfully navigated data-scarce "cold start" scenarios for RSV. Moreover, controlled retrospective ablations revealed that optimizing log-scale distance metrics prevents reward hacking, while an automated judge-in-the-loop ensures structural fidelity to complex scientific theories. By autonomously translating epidemiological theory into accurate, transparent code, this framework overcomes the modeling labor bottleneck, enabling rapid deployment of expert-level disease forecasting at unprecedented scales.

摘要:傳染病的概率預測對公共衛生至關重要,但依賴專業建模團隊進行勞動密集型的手動模型管理。這種定制開發限制了對細粒度地理解析或新興病原體的可擴展性。在此,我們提出了一個自動化系統,利用大型語言模型(LLM)指導的樹搜索,迭代生成、評估和優化可執行的預測軟件。在2025-2026年美國呼吸季節的全面前瞻性實時評估中,該系統自主發現了流感、COVID-19和呼吸道合胞病毒(RSV)方法論上多樣的模型。聚合這些機器生成的模型產生了一個集成,始終與金標準的人類策劃的疾病控制與預防中心(CDC)集成模型在樣本外一致或表現更佳。該系統成功應對了RSV數據稀缺的“冷啟動”場景。此外,受控的回顧性消融顯示,優化對數尺度距離度量可以防止獎勵黑客,而自動化的循環評判確保了對複雜科學理論的結構忠實。通過自主將流行病學理論轉化為準確、透明的代碼,這一框架克服了建模勞動瓶頸,使專業級疾病預測能在前所未有的規模上迅速部署。

FORGE: Self-Evolving Agent Memory With No Weight Updates via Population Broadcast

2605.16233v1 by Igor Bogdanov, Chung-Horng Lung, Thomas Kunz, Jie Gao, Adrian Taylor, Marzia Zaman

Can LLM agents improve decision-making through self-generated memory without gradient updates? We propose FORGE (Failure-Optimized Reflective Graduation and Evolution), a staged, population-based protocol that evolves prompt-injected natural-language memory for hierarchical ReAct agents. FORGE wraps a Reflexion-style inner loop, where a dedicated reflection agent (using the same underlying LLM, no distillation from a stronger model) converts failed trajectories into reusable knowledge artifacts: textual heuristics (Rules), few-shot demonstrations (Examples), or both (Mixed), with an outer loop that propagates the best-performing instance's memory to the population between stages and freezes converged instances via a graduation criterion. We evaluate on CybORG CAGE-2, a stochastic network-defense POMDP at a 30-step horizon against the B-line attacker, where all four tested LLM families (Gemini-2.5-Flash-Lite, Grok-4-Fast, Llama-4-Maverick, Qwen3-235B) exhibit strongly negative, heavy-tailed zero-shot rewards. Compared against both a zero-shot baseline and a Reflexion baseline (isolated single-stream learning), FORGE improves average evaluation return by 1.7-7.7$\times$ over zero-shot and by 29-72% over Reflexion in all 12 model-representation conditions, reducing major-failure rates (below $-100$) to as low as $\sim$1%. We find that (1) population broadcast is critical mechanism, with a no-graduation ablation confirming that broadcast carries the performance gains while graduation primarily saves compute; (2) Examples achieves the strongest returns for three of four models, Rules offers the best cost-reliability profile with $\sim$40% fewer tokens; and (3) weaker baseline models benefit disproportionately, suggesting FORGE may mitigate capability gaps rather than amplify strong models. All evidence is confined to CAGE-2 B-line; cross-family findings are directional evidence.

摘要:LLM代理能否通過自生成記憶而不進行梯度更新來改善決策?我們提出了FORGE(失敗優化反思畢業與演化),這是一種分階段的基於人群的協議,旨在為層次化ReAct代理演化提示注入的自然語言記憶。FORGE包裹了一個反思風格的內循環,其中一個專門的反思代理(使用相同的底層LLM,沒有從更強模型中提煉)將失敗的軌跡轉換為可重用的知識工件:文本啟發式(規則)、少量示範(示例)或兩者結合(混合),並設有一個外循環,在階段之間將表現最佳的實例的記憶傳播到整個人群,並通過畢業標準凍結收斂的實例。我們在CybORG CAGE-2上進行評估,這是一個在30步視野下針對B線攻擊者的隨機網絡防禦POMDP,其中四個測試的LLM家族(Gemini-2.5-Flash-Lite、Grok-4-Fast、Llama-4-Maverick、Qwen3-235B)都顯示出強烈的負面、重尾的零-shot獎勵。與零-shot基線和反思基線(孤立的單流學習)相比,FORGE在所有12個模型表示條件下將平均評估回報提高了1.7-7.7$\times$,並在反思上提高了29-72%,將主要失敗率(低於$-100$)降低到約1%。我們發現(1)人群廣播是一個關鍵機制,無畢業的消融實驗確認廣播帶來了性能增益,而畢業主要節省計算資源;(2)示例在四個模型中的三個模型上實現了最強的回報,規則則提供了最佳的成本可靠性配置,令標記數減少約40%;(3)較弱的基線模型受益不成比例,這表明FORGE可能減輕能力差距,而不是放大強模型。所有證據均限於CAGE-2 B線;跨家族的發現僅為方向性證據。

Evaluating Design Video Generation: Metrics for Compositional Fidelity

2605.16223v1 by Adrienne Deganutti, Dingning Cao, Jaejung Seol, Elad Hirsch, Purvanshi Mehta

Generative video models are increasingly used in design animation tasks, yet no standardized evaluation framework exists for this domain. Unlike natural video generation, design animation imposes structured constraints: specific components shall animate with prescribed motion types, directions, speed and timing, while non-animated regions must remain stable and layout structure must be preserved. This paper provides a fully automated evaluation framework organized across four dimensions: layout fidelity, motion correctness, temporal quality, and content fidelity. This eliminates the reliance on subjective human evaluation and establishes a common basis for benchmarking progress in the field.

摘要:生成視頻模型在設計動畫任務中越來越多地被使用,但該領域尚未存在標準化的評估框架。與自然視頻生成不同,設計動畫施加了結構化的約束:特定組件應以規定的運動類型、方向、速度和時機進行動畫,而非動畫區域必須保持穩定,佈局結構必須得到保留。本文提供了一個完全自動化的評估框架,涵蓋四個維度:佈局保真度、運動正確性、時間質量和內容保真度。這消除了對主觀人類評估的依賴,並建立了一個共同的基礎,以基準化該領域的進展。

Artificial Aphasias in Lesioned Language Models

2605.16222v1 by Nathan Roll, Jill Kries, Laura Gwilliams, Cory Shain

Aphasias, selective language impairments which can arise from brain damage, reveal the functional organization of human language by providing causal links between affected brain regions and specific symptom profiles. Drawing on this literature, we introduce an aphasia-inspired technique to characterize the emergent functional organization of language models (LMs). We ``lesion'' (zero-out) model parameters and measure the effects of this intervention against clinical aphasia symptoms, as diagnosed by the Text Aphasia Battery (TAB). When applied to 112,426 outputs from five 1B-scale LMs, the full range of evaluated symptoms surface, but in distributions largely distinct from those of humans. Our method uncovers broad symptom-profile differences between attention components (query, key, value, output) and feed-forward components (up, gate, down), with weaker evidence for differences among components within the same mechanism. We also find an effect of depth, where lesions in early layers disproportionately cause syntactic and semantic symptoms while late-middle layers yield higher rates of phonological and fluency deficits. Although some LM lesions induce quantitatively more similar profiles to some human aphasia types than others, qualitative differences in symptom patterns between LMs and humans suggest that aphasia syndromes are heavily influenced by the details of learning and processing rather than being a domain-invariant consequence of disrupted language processing.

摘要:失語症是由腦損傷引起的選擇性語言障礙,通過提供受影響腦區與特定症狀特徵之間的因果聯繫,揭示了人類語言的功能組織。基於這些文獻,我們引入了一種受失語症啟發的技術,以特徵化語言模型(LMs)新興的功能組織。我們對模型參數進行“損傷”(歸零處理),並測量這一干預對臨床失語症狀的影響,這些症狀是通過文本失語症測試(TAB)診斷的。當應用於來自五個1B規模LM的112,426個輸出時,評估的症狀範圍全面顯現,但其分佈與人類的分佈大相徑庭。我們的方法揭示了注意力組件(查詢、鍵、值、輸出)和前饋組件(上、閘、下)之間的廣泛症狀特徵差異,而同一機制內部組件之間的差異證據較弱。我們還發現深度的影響,早期層的損傷不成比例地引起句法和語義症狀,而中後期層則產生更高的音韻和流暢性缺陷的比率。儘管某些LM損傷所誘發的症狀特徵在某些人類失語症類型中表現出量化上更相似的特徵,但LM和人類之間症狀模式的質性差異表明,失語症綜合症受到學習和處理細節的強烈影響,而不是由破壞語言處理所造成的領域不變後果。

Argus: Evidence Assembly for Scalable Deep Research Agents

2605.16217v1 by Zhen Zhang, Liangcai Su, Zhuo Chen, Xiang Lin, Haotian Xu, Simon Shaolei Du, Kaiyu Yang, Bo An, Lidong Bing, Xinyu Wang

Deep research agents have achieved remarkable progress on complex information seeking tasks. Even long ReAct style rollouts explore only a single trajectory, while recent state of the art systems scale inference time compute via parallel search and aggregation. Yet deep research answers are composed of complementary pieces of evidence, which parallel rollouts often duplicate rather than complete, yielding diminishing returns while pushing the aggregation context toward the model's limit. We propose Argus, an agentic system in which a Searcher and a Navigator cooperate to treat deep research as assembling a jigsaw from complementary evidence pieces, rather than brute forcing the whole answer in parallel. The Searcher collects evidence traces for a given sub-query through ReAct-style interaction. The Navigator maintains a shared evidence graph, verifying which pieces are still missing, dispatching Searchers to gather them, and reasoning over the completed graph to produce a source-traced final answer. We train the Navigator with reinforcement learning to verify, dispatch, and synthesize, while independently training the Searcher to remain a standard ReAct agent. The resulting Navigator supports rollouts with a single Searcher or many in parallel without retraining. With both Searcher and Navigator built on a 35B-A3B MoE backbone, Argus gains 5.5 points with a single Searcher and 12.7 points with 8 parallel Searchers, averaged over eight benchmarks. With 64 Searchers it reaches 86.2 on BrowseComp, surpassing every proprietary agent we benchmark, while the Navigator's reasoning context stays under 21.5K tokens.

摘要:深度研究代理在複雜的信息搜尋任務上取得了顯著的進展。即使是長 ReAct 風格的展開也僅探索單一的軌跡,而最近的最先進系統則通過平行搜尋和聚合來擴展推理時間計算。然而,深度研究的答案由互補的證據組成,這些平行展開往往重複而不是補全,導致收益遞減,同時將聚合上下文推向模型的極限。我們提出 Argus,一個代理系統,其中搜尋者和導航者合作,將深度研究視為從互補的證據片段中組裝拼圖,而不是在平行中強行獲得整個答案。搜尋者通過 ReAct 風格的互動收集給定子查詢的證據痕跡。導航者維護一個共享的證據圖,驗證哪些片段仍然缺失,派遣搜尋者去收集它們,並對完成的圖進行推理以產出源追蹤的最終答案。我們使用強化學習訓練導航者進行驗證、派遣和綜合,同時獨立訓練搜尋者以保持標準的 ReAct 代理。最終的導航者支持單一搜尋者或多個平行搜尋者的展開,而無需重新訓練。在 35B-A3B MoE 主幹上構建的搜尋者和導航者,Argus 在單一搜尋者上獲得 5.5 分,在 8 個平行搜尋者上獲得 12.7 分,平均在八個基準測試中表現。使用 64 個搜尋者時,它在 BrowseComp 上達到 86.2,超越我們基準測試的每個專有代理,而導航者的推理上下文保持在 21.5K 令牌以下。

Fully Open Meditron: An Auditable Pipeline for Clinical LLMs

2605.16215v1 by Xavier Theimer-Lienhard, Mushtaha El-Amin, Fay Elhassan, Sahaj Vaidya, Victor Cartier-Negadi, David Sasu, Lars Klein, Mary-Anne Hartley

Clinical decision support systems (CDSS) require scrutable, auditable pipelines that enable rigorous, reproducible validation. Yet current LLM-based CDSS remain largely opaque. Most "open" models are open-weight only, releasing parameters while withholding the data provenance, curation procedures, and generation pipelines that determine model behavior. Fully Open (FO) models, which expose the complete training stack end-to-end, do not currently exist in medicine. We introduce Fully Open Meditron, the first fully open pipeline for building LLM-CDSS, comprising a clinician-audited training corpus, a reproducible data construction and training framework, and a use-aligned evaluation protocol. The corpus unifies eight public medical QA datasets into a normalized conversational format and expands coverage with three clinician-vetted synthetic extensions: exam-style QA, guideline-grounded QA derived from 46,469 clinical practice guidelines, and clinical vignettes. The pipeline enforces system-wide decontamination, gold-label resampling of teacher generations, and end-to-end validation by a four-physician panel. We evaluate using an LLM-as-a-judge protocol over expert-written clinical vignettes, calibrated against 204 human raters. We apply the recipe to five FO base models (Apertus-70B/8B-Instruct, OLMo-2-32B-SFT, EuroLLM-22B/9B-Instruct). All MeditronFO variants are preferred over their bases. Apertus-70B-MeditronFO improves +6.6 points over its base (47.2% to 53.8%) on aggregate medical benchmarks, establishing a new FO SoTA. Gemma-3-27B-MeditronFO is preferred over MedGemma in 58.6% of LLM-as-a-judge comparisons and outperforms it on HealthBench (58% vs 55.9%). These results show that fully open pipelines can achieve state-of-the-art domain-specific performance without sacrificing auditability or reproducibility.

摘要:臨床決策支持系統(CDSS)需要可審查、可審計的流程,以實現嚴謹且可重複的驗證。然目前基於大型語言模型(LLM)的CDSS仍然在很大程度上不透明。大多數“開放”模型僅為開放權重,釋放參數的同時卻隱藏了決定模型行為的數據來源、策展程序和生成流程。完全開放(FO)模型,即從頭到尾公開完整訓練堆疊的模型,目前在醫學領域尚不存在。我們介紹了完全開放的Meditron,這是第一個用於構建LLM-CDSS的完全開放流程,包括經臨床醫生審核的訓練語料庫、可重複的數據構建和訓練框架,以及與使用對齊的評估協議。該語料庫將八個公共醫療問答數據集統一為標準化的對話格式,並通過三個經臨床醫生驗證的合成擴展進行擴展:考試風格的問答、基於46,469條臨床實踐指南的指南導向問答,以及臨床小插曲。該流程強制執行系統範圍內的去污染、教師生成的金標籤重抽樣,以及由四位醫生小組進行的端到端驗證。我們使用LLM作為評判者的協議,對專家撰寫的臨床小插曲進行評估,並與204名人類評審進行校準。我們將該方法應用於五個FO基礎模型(Apertus-70B/8B-Instruct、OLMo-2-32B-SFT、EuroLLM-22B/9B-Instruct)。所有MeditronFO變體都優於其基礎模型。Apertus-70B-MeditronFO在綜合醫療基準上比其基礎模型提高了6.6個百分點(從47.2%提高到53.8%),創造了新的FO最先進技術(SoTA)。Gemma-3-27B-MeditronFO在58.6%的LLM作為評判者的比較中優於MedGemma,並在HealthBench上表現優於它(58%對55.9%)。這些結果顯示,完全開放的流程可以在不犧牲可審計性或可重複性的情況下實現最先進的特定領域性能。

Confirming Correct, Missing the Rest: LLM Tutoring Agents Struggle Where Feedback Matters Most

2605.16207v1 by Tahreem Yasir, Wenbo Li, Sam Gilson, Sutapa Dey Tithi, Xiaoyi Tian, Tiffany Barnes

Effective tutoring requires distinguishing optimal, valid but suboptimal, and incorrect student solutions, a distinction central to intelligent tutoring systems (ITS) but untested for LLM-based tutors. As LLMs are increasingly explored as conversational complements to ITS, evaluating their diagnostic precision is essential. We present a benchmark of seven LLM feedback agents in propositional logic using knowledge-graph-derived ground truth across 10,836 solution--feedback pairs and three feedback conditions. Models achieved near-ceiling performance on optimal steps but systematically over-rejected valid but suboptimal reasoning and over-validated incorrect solutions, precisely where adaptive tutoring matters most. These failures persisted across models regardless of solution context, suggesting architectural rather than informational limits. Moreover, accurate diagnosis did not reliably produce pedagogically actionable feedback, revealing a gap between diagnostic judgment and instructional effectiveness. Our findings suggest that LLMs are better suited for hybrid architectures where KG-grounded models handle diagnosis while LLMs support open-ended scaffolding and dialogue.

摘要:有效的輔導需要區分最佳、有效但次優以及不正確的學生解答,這一區分對於智能輔導系統(ITS)至關重要,但對於基於LLM的輔導尚未進行測試。隨著LLM越來越多地被探索作為ITS的對話補充,評估它們的診斷精確性變得至關重要。我們提出了一個基準,評估七個LLM反饋代理在命題邏輯中的表現,使用知識圖譜衍生的真實數據,涵蓋10,836對解答-反饋配對和三種反饋條件。模型在最佳步驟上達到了接近上限的表現,但系統性地過度拒絕有效但次優的推理,並過度驗證不正確的解答,這恰恰是自適應輔導最為重要的地方。這些失誤在不同模型中持續存在,無論解答的上下文如何,這表明是架構而非信息的限制。此外,準確的診斷並未可靠地產生可教學的可行反饋,顯示出診斷判斷與教學有效性之間的差距。我們的研究結果表明,LLM更適合用於混合架構,其中基於KG的模型處理診斷,而LLM則支持開放式的支架和對話。

Context, Reasoning, and Hierarchy: A Cost-Performance Study of Compound LLM Agent Design in an Adversarial POMDP

2605.16205v1 by Igor Bogdanov, Chung-Horng Lung, Thomas Kunz, Jie Gao, Adrian Taylor, Marzia Zaman

Deploying compound LLM agents in adversarial, partially observable sequential environments requires navigating several design dimensions: (1) what the agent sees, (2) how it reasons, and (3) how tasks are decomposed across components. Yet practitioners lack guidance on which design choices improve performance versus merely increase inference costs. We present a controlled study of compound LLM agent design in CybORG CAGE-2, a cyber defense environment modeled as a Partially Observable Markov Decision Process (POMDP). Reward is non-positive, so all configurations operate in a failure-mitigation mode. Our evaluation spans five model families, six models, and twelve configurations (3,475 episodes) with token-level cost accounting. We vary context representation (raw observations vs. a deterministic state-tracking layer with compressed history), deliberation (self-questioning, self-critique, and self-improvement tools, with optional chain-of-thought prompting), and hierarchical decomposition (monolithic ReAct vs. delegation to specialized sub-agents). We find that: (1) Programmatic state abstraction delivers the largest returns per token spent (RPTS), improving mean return by up to 76% over raw observations. (2) Distributing deliberation tools across a hierarchy degrades performance relative to hierarchy alone for all five model families, reaching up to 3.4$\times$ worse mean return while using 1.8-2.7$\times$ more tokens. We call this destructive pattern a deliberation cascade. (3) Hierarchical decomposition without deliberation achieves the best absolute performance for most models, and context engineering is generally more cost-effective than deliberation. These findings suggest a design principle for structured adversarial POMDPs: invest in programmatic infrastructure and clean task decomposition rather than deeper per-agent reasoning, as these strategies can interfere when combined.

摘要:在對抗性、部分可觀察的序列環境中部署複合 LLM 代理需要導航幾個設計維度:(1) 代理所見的內容,(2) 它如何推理,以及 (3) 任務如何在組件之間進行分解。 然而,實踐者缺乏關於哪些設計選擇能改善性能而不僅僅是增加推理成本的指導。我們在 CybORG CAGE-2 中展示了一項對複合 LLM 代理設計的控制研究,這是一個被建模為部分可觀察馬爾可夫決策過程 (POMDP) 的網絡防禦環境。獎勵是非正的,因此所有配置都在失敗緩解模式下運行。我們的評估涵蓋了五個模型系列、六個模型和十二個配置(3,475 個回合),並進行了基於標記級別的成本核算。我們變更了上下文表示(原始觀察與帶有壓縮歷史的確定性狀態跟蹤層)、深思熟慮(自我提問、自我批評和自我改進工具,並可選擇鏈式思考提示),以及分層分解(單一的 ReAct 與委派給專門的子代理)。我們發現:(1) 程式化狀態抽象每花費一個標記所獲得的回報最大(RPTS),使平均回報比原始觀察提高了多達 76%。(2) 在層級中分配深思熟慮工具會相對於僅使用層級而降低性能,對所有五個模型系列而言,平均回報最差可達 3.4$\times$,同時使用 1.8-2.7$\times$ 更多的標記。我們稱這種破壞性模式為深思熟慮級聯。(3) 在沒有深思熟慮的情況下進行的分層分解為大多數模型實現了最佳的絕對性能,並且上下文工程通常比深思熟慮更具成本效益。這些發現為結構化對抗性 POMDP 提出了設計原則:投資於程式化基礎設施和清晰的任務分解,而不是更深入的每代理推理,因為這些策略在結合時可能會相互干擾。

Formal Methods Meet LLMs: Auditing, Monitoring, and Intervention for Compliance of Advanced AI Systems

2605.16198v1 by Parand A. Alamdari, Toryn Q. Klassen, Sheila A. McIlraith

We examine one particular dimension of AI governance: how to monitor and audit AI-enabled products and services throughout the AI development lifecycle, from pre-deployment testing to post-deployment auditing. Combining principles from formal methods with SoTA machine learning, we propose techniques that enable AI-enabled product and service developers, as well as third party AI developers and evaluators, to perform offline auditing and online (runtime) monitoring of product-specific (temporally extended) behavioral constraints such as safety constraints, norms, rules and regulations with respect to black-box advanced AI systems, notably LLMs. We further provide practical techniques for predictive monitoring, such as sampling-based methods, and we introduce intervening monitors that act at runtime to preempt and potentially mitigate predicted violations. Experimental results show that by exploiting the formal syntax and semantics of Linear Temporal Logic (LTL), our proposed auditing and monitoring techniques are superior to LLM baseline methods in detecting violations of temporally extended behavioral constraints; with our approach, even small-model labelers match or exceed frontier LLM judges. Our predictive and intervening monitors significantly reduce the violation rates of LLM-based agents while largely preserving task performance. We further show through controlled experiments that LLMs' temporal reasoning shows a pronounced degradation in accuracy with increasing event distance, number of constraints, and number of propositions.

摘要:我們檢視人工智慧治理的一個特定面向:如何在整個人工智慧開發生命周期中監控和審計人工智慧驅動的產品和服務,從部署前測試到部署後審計。結合形式方法的原則與最先進的機器學習,我們提出技術,使得人工智慧驅動的產品和服務開發者,以及第三方人工智慧開發者和評估者,能夠對產品特定的(時間延展的)行為約束,如安全約束、規範、規則和關於黑箱先進人工智慧系統(特別是大型語言模型)的法規進行離線審計和在線(運行時)監控。我們進一步提供預測監控的實用技術,如基於取樣的方法,並引入在運行時介入的監控器,以預防和潛在減輕預測的違規行為。實驗結果顯示,通過利用線性時間邏輯(LTL)的形式語法和語義,我們提出的審計和監控技術在檢測時間延展行為約束的違規行為方面優於大型語言模型的基準方法;在我們的方法中,即使是小型模型標註者也能匹配或超越前沿的大型語言模型評判者。我們的預測和介入監控器顯著降低了基於大型語言模型的代理的違規率,同時在很大程度上保持了任務性能。我們進一步通過控制實驗顯示,隨著事件距離、約束數量和命題數量的增加,大型語言模型的時間推理在準確性上顯示出明顯的退化。

paper.json: A Coordination Convention for LLM-Agent-Actionable Papers

2605.16194v1 by Arquimedes Canedo

LLM agents routinely serve as first (and sometimes only) readers of academic papers, skimming for sub-claims, extracting reproducibility steps, and generalizing scope. Standard prose papers produce recurring failures in this role: sub-claims that cannot be cited at sub-paper granularity, scope overextension beyond what the paper tests, and figure commands buried in codebases rather than the paper itself. We propose paper.json, a companion JSON file that travels with the PDF and addresses each failure with a lightweight convention: stable claim IDs (C1), an explicit does-not-claim list (C2), exact per-figure shell commands (C3), and stable definition IDs (C5). A fifth convention (C4) holds that minimum viable compliance, hand-written JSON alongside the PDF, is achievable in under an hour for a finished paper without touching the human-readable output. C1, C2, C3, and C5 are open invitations: an agent that reads a compliant paper and acts on it produces evidence for or against them. This paper is itself compliant: uv run validator.py paper.json --against paper.typ passes. Repo: https://github.com/arquicanedo/paper-json

摘要:LLM 代理人通常作為學術論文的首位(有時也是唯一)讀者,快速瀏覽子主張,提取可重現性步驟,並概括範疇。標準的散文論文在這一角色中經常出現失敗:無法在子論文粒度上引用的子主張、超出論文測試範圍的範疇擴展,以及埋藏在代碼庫中的圖形命令,而非論文本身。我們提出 paper.json,這是一個伴隨 PDF 的 JSON 文件,針對每個失敗採用輕量級的約定:穩定的主張 ID(C1)、明確的非主張列表(C2)、每個圖形的精確外殼命令(C3)以及穩定的定義 ID(C5)。第五個約定(C4)認為,最低可行的合規性,即在 PDF 旁邊手寫的 JSON,可以在一小時內為完成的論文實現,而無需觸及人類可讀的輸出。C1、C2、C3 和 C5 是公開邀請:一個閱讀合規論文並對其採取行動的代理人會產生支持或反對它們的證據。這篇論文本身就是合規的:uv run validator.py paper.json --against paper.typ 會通過。Repo: https://github.com/arquicanedo/paper-json

Improving Cross-Cultural Survey Simulation with Calibrated Value Personas

2605.16193v1 by Axel Abels, Elias Fernandez Domingos, Apurva Shah, Tom Lenaerts

Large language models (LLMs) are increasingly used to simulate human opinions and survey responses, but their ability to reproduce population responses across cultures remains limited. Existing persona-based prompting methods typically rely on sociodemographic or personality traits, which are only indirect proxies for the values that shape human responses. We propose a value-based persona construction method that derives textual descriptors from survey responses capturing core cultural dimensions. By sampling value profiles from target populations and aggregating LLM responses across personas, we obtain population-level predictions grounded in observed value distributions. We further introduce a calibration procedure that improves response diversity while preserving estimated opinions. We show that our approach reduces prediction error across countries, with the largest improvements observed in underrepresented populations. This substantially narrows the performance gap between countries aligned with dominant LLM priors and those that are less represented in training data, while also yielding response distributions that closely match human diversity.

摘要:大型語言模型(LLMs)越來越多地被用來模擬人類意見和調查回應,但它們在跨文化再現人口回應的能力仍然有限。現有的基於角色的提示方法通常依賴於社會人口或個性特徵,這些特徵僅是塑造人類回應的價值觀的間接代理。我們提出了一種基於價值觀的角色構建方法,從捕捉核心文化維度的調查回應中推導文本描述符。通過從目標人群中抽樣價值概況並聚合各角色的LLM回應,我們獲得了基於觀察到的價值分佈的群體層級預測。我們進一步引入了一種校準程序,該程序在保持預估意見的同時提高回應的多樣性。我們顯示我們的方法減少了各國之間的預測誤差,並且在代表性不足的人口中觀察到最大的改進。這大大縮小了與主導LLM先驗一致的國家和在訓練數據中較少代表的國家之間的性能差距,同時也產生了與人類多樣性緊密匹配的回應分佈。

2605.16191v1 by Michael P. Brenner, Lizzie Dorfman, John C. Platt

We present a case study for how AI coding systems can be used to generate novel scientific hypotheses. We combine a generic coding agent (Google's AntiGravity) with an LLM-driven tree search algorithm (Empirical Research Assistance / ERA) to autonomously generate high-efficiency three-dimensional photovoltaic (3DPV) structures that overcome losses limiting flat solar panels at mid-latitudes. These structures operate by presenting favorable angles to the sun throughout the day, and for illustrative purposes we focus on optimizing performance for a single solar day. Our workflow begins by using AntiGravity to reproduce calculations \cite{bernardi2012solar} showing that 3DPV can have energy densities much higher than stationary flat PV panels. We use these initial designs as the starting point for large scale tree search, where we seek improved solutions and score them for their diurnal yield. The initial tree search leads to nominally more efficient solutions, yet they are caused by algorithmic reward hacking, arising from non-physical design features such as structurally levitating disconnected tiers and exploitations of the discretizations in the optics solver. To counteract this, we develop a workflow where the coding agent iteratively patches the physics engine with constraints to eliminate reward hacking. With reward-hacking eliminated, ERA discovers a series of designs with various constraints and improved performance, including optimal designs with different fixed collector areas, optimizing zenith tracking and avoiding self shadowing. Combining coding agents with tree search (ERA) provides a powerful platform for scientific discovery, for problems whose solutions can be empirically evaluated with a score function.

摘要:我們展示了一個案例研究,探討AI編碼系統如何用於生成新穎的科學假設。我們將一個通用的編碼代理(Google的AntiGravity)與一個基於LLM的樹搜尋算法(實證研究助手/ERA)結合,來自主生成高效能的三維光伏(3DPV)結構,克服限制中緯度平面太陽能電池板的損失。這些結構通過在一天中呈現有利的角度面對陽光來運作,為了說明目的,我們專注於優化單個太陽日的性能。我們的工作流程首先使用AntiGravity重現計算\cite{bernardi2012solar},顯示3DPV的能量密度可以遠高於靜止的平面光伏面板。我們將這些初始設計作為大規模樹搜尋的起點,在此過程中我們尋求改進的解決方案並對其日間產量進行評分。初始的樹搜尋導致名義上更高效的解決方案,但這些解決方案是由於算法獎勵駭客行為造成的,這源於如結構懸浮的斷開層和光學求解器中的離散化利用等非物理設計特徵。為了抵消這一點,我們開發了一個工作流程,讓編碼代理迭代地用約束修補物理引擎,以消除獎勵駭客行為。在消除獎勵駭客行為後,ERA發現了一系列具有各種約束和改進性能的設計,包括具有不同固定收集面積的最佳設計,優化天頂跟踪並避免自我陰影。將編碼代理與樹搜尋(ERA)結合,為科學發現提供了一個強大的平台,適用於那些其解決方案可以通過評分函數進行實證評估的問題。

Second-Order Multi-Level Variance Correction for Modality Competition in Multimodal Models

2605.16165v1 by Yishun Lu, Wes Armour

Autoregressive next-token training offers a unified formulation for image generation and text understanding, but it also creates strong modality competition that destabilizes optimization and limits large-batch scaling. We show that first-order optimizers such as AdamW are vulnerable to cross-modality gradient heterogeneity, while second-order preconditioning, particularly SOAP, provides a more stable basis for multimodal alignment. Building on this insight, we propose \emph{ML-FOP-SOAP}, a second-order optimization framework with Multi-Level Variance Correction. Our Fisher-Orthogonal Projection suppresses variance-induced modality conflicts, reducing the trade-off between visual generation and textual understanding. To make this practical under large gradient accumulation, we introduce a hierarchical folding strategy that captures fine-grained variance with low micro-step overhead. Experiments on Janus and Emu3 show consistent gains across both modalities and stable training at batch size 8192. Compared with AdamW, our method improves sample efficiency by up to $1.4\times$ and accelerates wall-clock training by up to $1.5\times$, offering a robust optimizer for scaling multimodal foundation models.

摘要:自回歸下一個標記的訓練為圖像生成和文本理解提供了一個統一的公式,但它也造成了強烈的模態競爭,這會使優化不穩定並限制大批量擴展。我們表明,像 AdamW 這樣的一階優化器對跨模態梯度異質性是脆弱的,而二階預處理,特別是 SOAP,則為多模態對齊提供了更穩定的基礎。基於這一見解,我們提出了 \emph{ML-FOP-SOAP},這是一個具有多層方差校正的二階優化框架。我們的費舍爾正交投影抑制了由方差引起的模態衝突,減少了視覺生成與文本理解之間的權衡。為了在大梯度累積下使這一方法實用,我們引入了一種層次折疊策略,能夠以低微步驟開銷捕捉細粒度方差。在 Janus 和 Emu3 上的實驗顯示,兩種模態均有一致的增益,並在批量大小 8192 下穩定訓練。與 AdamW 相比,我們的方法提高了樣本效率達到 $1.4\times$,並加速了牆鐘訓練達到 $1.5\times$,為擴展多模態基礎模型提供了一個穩健的優化器。

An Algebraic Exposition of the Theory of Dyadic Morality

2605.16153v1 by Kush R. Varshney

This paper provides an algebraic exposition of the theory of dyadic morality (TDM), a psychological model of moral judgment grounded in a simple two-node template: an intentional agent causing harm to a vulnerable patient. We formalize TDM using structural causal modeling (SCM) notation and identify three psychological operators (typecasting operator, completion operator, and valence-dependent inference mechanism) that extend standard SCM to capture how people compute moral judgments under constraints. We address scalability challenges arising from TDM's dyadic limitation, showing how moral cognition compresses multi-node scenarios through node collapse and sequential processing. Drawing on this algebraic framework, we demonstrate concrete applications to AI policy design: detecting conflicting obligations, structuring helpfulness policies to preserve user agency, and designing post-failure communication as causal interventions. Finally, we recommend scoped, contextual measurement of mind perception over universal averaging to operationalize the theory empirically. This algebraic formalization enables neurosymbolic AI systems to compute morality in a way that is both mathematically rigorous and faithful to human moral cognition.

摘要:這篇論文提供了二元道德理論(TDM)的代數闡述,這是一個基於簡單的兩節點模板的道德判斷心理模型:一個有意圖的行為者對一個脆弱的患者造成傷害。我們使用結構性因果建模(SCM)符號來形式化TDM,並確定了三個心理運算子(類型化運算子、完成運算子和價值依賴推理機制),這些運算子擴展了標準SCM,以捕捉人們在約束下如何計算道德判斷。我們解決了由於TDM的二元限制而產生的可擴展性挑戰,展示了道德認知如何通過節點崩潰和序列處理來壓縮多節點場景。基於這個代數框架,我們展示了對AI政策設計的具體應用:檢測衝突的義務、結構化有助於保護用戶自主權的幫助政策,以及將失敗後的溝通設計為因果干預。最後,我們建議對心智感知進行範疇化、情境化的測量,而不是進行普遍的平均,以便將理論實證化。這一代數形式化使神經符號AI系統能夠以數學上嚴謹且忠實於人類道德認知的方式計算道德。

Look Before You Leap: Autonomous Exploration for LLM Agents

2605.16143v1 by Ziang Ye, Wentao Shi, Yuxin Liu, Yu Wang, Zhengzhou Cai, Yaorui Shi, Qi Gu, Xunliang Cai, Fuli Feng

Large language model based agents often fail in unfamiliar environments due to premature exploitation: a tendency to act on prior knowledge before acquiring sufficient environment-specific information. We identify autonomous exploration as a critical yet underexplored capability for building adaptive agents. To formalize and quantify this capability, we introduce Exploration Checkpoint Coverage, a verifiable metric that measures how broadly an agent discovers key states, objects, and affordances. Our systematic evaluation reveals that agents trained with standard task-oriented reinforcement learning consistently exhibit narrow and repetitive behaviors that impede downstream performance. To address this limitation, we develop a training strategy that interleaves task-execution rollouts and exploration rollouts, with each type of rollout optimized by its corresponding verifiable reward. Building on this training strategy, we propose the Explore-then-Act paradigm, which decouples information-gathering from task execution: agents first utilize an interaction budget to acquire grounded environmental knowledge, then leverage it for task resolution. Our results demonstrate that learning to systematically explore is imperative for building generalizable and real-world-ready agents.

摘要:大型語言模型基礎的代理在不熟悉的環境中常常因為過早的利用而失敗:這是一種在獲得足夠的環境特定信息之前,根據先前知識行動的傾向。我們認為自主探索是一種關鍵但尚未充分探討的能力,用於構建自適應代理。為了形式化和量化這一能力,我們引入了探索檢查點覆蓋率,這是一個可驗證的指標,用於衡量代理發現關鍵狀態、物體和可用性的廣度。我們的系統評估顯示,使用標準任務導向強化學習訓練的代理始終表現出狹隘和重複的行為,這妨礙了下游性能。為了解決這一限制,我們開發了一種訓練策略,交錯任務執行的回合和探索的回合,每種類型的回合都通過其相應的可驗證獎勵進行優化。在這一訓練策略的基礎上,我們提出了探索-再行動的範式,該範式將信息收集與任務執行解耦:代理首先利用互動預算獲取基於環境的知識,然後利用這些知識解決任務。我們的結果表明,系統性地學習探索對於構建可泛化和適合現實世界的代理是至關重要的。

Property-Guided LLM Program Synthesis for Planning

2605.16142v1 by Augusto B. Corrêa, André G. Pereira, Jendrik Seipp

LLMs have shown impressive success in program synthesis, discovering programs that surpass prior solutions. However, these approaches rely on simple numeric scores to signal program quality, such as the value of the solution or the number of passed tests. Because a score offers no guidance on why a program failed, the system must generate and evaluate many candidates hoping some succeed, increasing LLM inference and evaluation costs. We study a different approach: property-guided LLM program synthesis. Instead of scoring programs after evaluation, we check whether a candidate satisfies a formally defined property. When the property is violated, we stop the evaluation early and provide the LLM with a concrete counterexample showing exactly how the program failed. This feedback drastically reduces both the number of program generations and the evaluation cost, and can guide the LLM to generate stronger programs. We evaluate this approach on PDDL planning domains, asking the LLM to synthesize direct heuristic functions: every state reachable by strictly improving transitions has a strictly improving successor. A heuristic with this property leads hill-climbing algorithm directly to a goal state. A counterexample-guided repair loop generates one candidate program, checks the property over a training set, and returns the first case that violates the property. We evaluate our approach on ten planning domains with an out-of-distribution test set. The synthesized heuristics are effectively direct on virtually all test tasks, and compared to the best prior generation method our approach generates seven times fewer programs per domain on average, solves more tasks without using search, and requires several orders of magnitude less computation to evaluate candidates. Whenever a problem admits a verifiable property, property-guided LLM synthesis can reduce cost and improve program quality.

摘要:LLMs 在程式合成方面展現了令人印象深刻的成功,發現的程式超越了先前的解決方案。然而,這些方法依賴於簡單的數值分數來指示程式質量,例如解的價值或通過測試的數量。因為分數無法提供程式失敗的原因指導,系統必須生成並評估許多候選程式,希望其中一些能成功,這增加了 LLM 的推理和評估成本。我們研究了一種不同的方法:基於屬性的 LLM 程式合成。我們不是在評估後對程式進行打分,而是檢查候選程式是否滿足正式定義的屬性。當屬性被違反時,我們提前停止評估,並向 LLM 提供一個具體的反例,顯示程式失敗的具體原因。這種反饋大幅減少了程式生成的數量和評估成本,並可以指導 LLM 生成更強的程式。我們在 PDDL 規劃領域評估這種方法,要求 LLM 合成直接啟發式函數:每個通過嚴格改進轉換可達的狀態都有一個嚴格改進的後繼狀態。具有這一屬性的啟發式函數可以將爬山演算法直接引導到目標狀態。反例引導的修復循環生成一個候選程式,檢查訓練集上的屬性,並返回第一個違反該屬性的案例。我們在十個規劃領域上評估我們的方法,使用一個分佈外的測試集。合成的啟發式在幾乎所有測試任務上都是有效的直接,並且與最佳的先前生成方法相比,我們的方法在每個領域平均生成的程式少了七倍,解決了更多的任務而不使用搜索,並且評估候選程式所需的計算量少了幾個數量級。每當問題允許可驗證的屬性時,基於屬性的 LLM 合成可以降低成本並提高程式質量。

Surrogate Neural Architecture Codesign Package (SNAC-Pack)

2605.16138v1 by Jason Weitz, Dmitri Demler, Benjamin Hawks, Aaron Wang, Nhan Tran, Javier Duarte

Neural architecture search (NAS) is a powerful approach for automating model design, but existing methods often optimize for accuracy alone or rely on proxy metrics such as bit operations (BOPs) that correlate poorly with hardware cost. This gap is particularly large for FPGA deployment, where cost is dominated by a multi-dimensional budget of lookup tables, DSPs, flip-flops, BRAM, and latency. We present the Surrogate Neural Architecture Codesign Package (SNAC-Pack), an open-source AutoML framework for hardware-aware neural architecture codesign and end-to-end FPGA deployment. SNAC-Pack runs a multi-objective global search with Optuna and NSGA-II, loading trials to a shared SQLite store that enables parallel workers across compute nodes. A hardware surrogate model outputs per-trial resource and latency estimates, avoiding the synthesis cost that would otherwise dominate the search loop. A local search stage then applies quantization-aware training (QAT) together with iterative magnitude pruning in a combined compression loop, after which the final model is synthesized to FPGA firmware via the hls4ml Python library. A YAML configuration and an optional agentic frontend let users run the pipeline on new datasets without modifying the framework. We demonstrate SNAC-Pack on jet classification at the Large Hadron Collider and superconducting qubit readout, discovering compact architectures that match or exceed strong baselines on the task metric while reducing FPGA resource utilization and, in the qubit readout case, reducing the design space exploration process from months of manual fine-tuning to hours of automated search.

摘要:神經架構搜尋(NAS)是一種強大的自動化模型設計方法,但現有的方法往往僅針對準確性進行優化,或依賴與硬體成本相關性不佳的代理指標,如位元操作(BOPs)。這一差距在FPGA部署中尤其明顯,因為成本主要受查找表、數位信號處理器(DSPs)、觸發器、BRAM和延遲的多維預算所主導。我們提出了替代神經架構共同設計套件(SNAC-Pack),這是一個開源的硬體感知神經架構共同設計和端到端FPGA部署的AutoML框架。SNAC-Pack使用Optuna和NSGA-II進行多目標全局搜尋,將實驗結果加載到共享的SQLite存儲中,實現計算節點之間的並行工作。硬體替代模型輸出每次實驗的資源和延遲估算,避免了本來會主導搜尋循環的綜合成本。然後,局部搜尋階段應用量化感知訓練(QAT)以及迭代幅度修剪,形成一個綜合壓縮循環,最終模型通過hls4ml Python庫合成為FPGA固件。一個YAML配置和一個可選的代理前端讓用戶在不修改框架的情況下,在新的數據集上運行管道。我們在大型強子對撞機的噴射分類和超導量子位讀出上展示了SNAC-Pack,發現了緊湊的架構,這些架構在任務指標上與強基準相匹配或超過,同時減少了FPGA資源利用率,並在量子位讀出的情況下,將設計空間探索過程從數月的手動微調縮短至數小時的自動搜尋。

2605.16134v1 by Simon Dufort-Labbé, Mehrab Hamidi, Razvan Pascanu, Ioannis Mitliagkas, Damien Scieur, Aristide Baratin

Sharpness-aware minimization (SAM) encourages flat minima by perturbing parameters along directions of high loss curvature, but treats all parameter directions uniformly, ignoring the underlying loss geometry. We introduce LLQR+SAM, which combines SAM with a learned preconditioner obtained from the recently proposed LLQR framework, a second-order method that recasts steepest descent as a layerwise linear-quadratic regulator problem. The preconditioner is updated sparsely and maintained as a slow exponential moving average, so it captures a smoothed, low-resolution picture of the loss landscape geometry. The SAM perturbation then operates on top of this learned geometry, probing curvature at a faster timescale. We show that this two-timescale structure is not merely a computational convenience: theoretically, the preconditioner amplifies the SAM escape signal in directions that are flat under the average geometry but locally sharp (potholes). Wide, flat basins, by contrast, remain stable. Empirically, LLQR+SAM gives consistent gains over both SAM and LLQR alone across standard vision and sequence modeling benchmarks, supporting the view that slow learned geometry and fast sharpness correction are genuinely complementary.

摘要:尖銳度感知最小化(SAM)通過沿著高損失曲率的方向擾動參數來鼓勵平坦的最小值,但對所有參數方向的處理是均勻的,忽略了潛在的損失幾何。我們引入了 LLQR+SAM,這是一種將 SAM 與從最近提出的 LLQR 框架中獲得的學習預處理器相結合的方法,LLQR 是一種將最陡下降重新表述為逐層線性二次調節器問題的二階方法。預處理器以稀疏的方式更新,並保持為一個緩慢的指數移動平均,因此它捕捉了損失景觀幾何的平滑、低解析度圖像。然後,SAM 擾動在這個學習的幾何之上運作,以更快的時間尺度探測曲率。我們表明,這種雙時間尺度結構不僅僅是一種計算便利:理論上,預處理器在平均幾何下是平坦但局部尖銳(坑洞)的方向上放大了 SAM 逃逸信號。相比之下,寬而平坦的盆地則保持穩定。在實證上,LLQR+SAM 在標準視覺和序列建模基準上相較於單獨的 SAM 和 LLQR 一直提供穩定的增益,支持了緩慢的學習幾何和快速的尖銳度修正確實是互補的觀點。

Entropy Across the Bridge: Conditional-Marginal Discretization for Flow and Schrödinger Samplers

2605.16126v1 by Bruno Trentini, Dejan Stancevic, Michael M. Bronstein, Alexander Tong, Luca Ambrogioni

For a fixed flow-based generative model under a small inference budget, sample quality can depend strongly on where the sampler spends its few function evaluations. Flow matching and Schrödinger bridges define probability paths, yet their inference grids are usually heuristic or inherited from one-endpoint diffusion. We derive a conditional-marginal entropy-rate objective for bridge-aware discretization, separating endpoint-conditioned bridge geometry from marginal flow evolution, and use it to build a training-free entropic inference-time scheduler from first principles. For Gaussian Brownian bridges this rate is closed-form and U-shaped, motivating boundary-heavy nonuniform grids. On trained two-dimensional bridge/flow models, the estimated profile recovers the predicted shape and improves 10-step ODE-Heun MMD over linear by 18.1%, with a paired 22.7% SDE-Heun improvement in the same low-NFE sweep. On EDM/CIFAR-10, the entropic time-discretization gives the best tested five-step FID (186.3 \pm 4.0 versus 200.5 \pm 2.9 for linear and 238.0 \pm 5.3 for cosine). On AlphaFlow protein generation, entropic conditional-marginal (cond-marg) scheduling shows advantage in low-NFE regimes on both CAMEO22 and ATLAS benchmarks. These results support entropy-rate scheduling as a practical low-budget allocation signal for high-dimensional bridge and flow samplers.

摘要:對於在小推理預算下的固定流基生成模型,樣本質量可能強烈依賴於取樣器花費其少量函數評估的位置。流匹配和薛丁格橋定義了概率路徑,但它們的推理網格通常是啟發式的或從單端點擴散繼承而來。我們推導出一個條件邊際熵率目標,用於橋意識的離散化,將端點條件的橋幾何與邊際流演化分開,並利用它從第一原則構建一個無需訓練的熵推理時間調度器。對於高斯布朗橋,這個速率是封閉形式且呈U形,激勵邊界重的非均勻網格。在訓練的二維橋/流模型上,估計的輪廓恢復了預測的形狀,並在10步ODE-Heun MMD上比線性提高了18.1%,在同一低-NFE掃描中配對的SDE-Heun改進為22.7%。在EDM/CIFAR-10上,熵時間離散化給出了最佳測試的五步FID(186.3 \pm 4.0,相較於線性的200.5 \pm 2.9和餘弦的238.0 \pm 5.3)。在AlphaFlow蛋白質生成中,熵條件邊際(cond-marg)調度在CAMEO22和ATLAS基準的低-NFE範疇中顯示出優勢。這些結果支持熵率調度作為高維橋和流取樣器的實用低預算分配信號。

SGR: A Stepwise Reasoning Framework for LLMs with External Subgraph Generation

2605.16117v1 by Xin Zhang, Yang Cao, Baoxing Wu, Kai Song, Siying Li

Large Language Models (LLMs) have demonstrated strong capabilities across diverse NLP applications, such as translation, text generation, and question answering. Nevertheless, they remain limited in complex settings that demand deep reasoning and logical inference. Since these models are trained on large-scale text corpora, their generation process may still introduce irrelevant, noisy, or factually inconsistent content. To mitigate this problem, we introduce SGR, a stepwise framework that enhances LLM reasoning through external subgraph generation. SGR builds query-specific subgraphs from external knowledge bases and uses their semantic structure to support multi-step inference. By grounding intermediate reasoning steps in structured external knowledge, the framework helps the model concentrate on relevant entities, relations, and supporting evidence. In particular, SGR first constructs a subgraph tailored to the input question. It then guides the model to reason progressively over the generated structure and combines multiple reasoning trajectories to obtain the final prediction. Experimental results across several benchmark datasets show that SGR achieves consistent improvements over competitive baselines, highlighting its value for improving both reasoning accuracy and factual reliability.

摘要:大型語言模型(LLMs)在翻譯、文本生成和問答等多樣化的自然語言處理應用中展示了強大的能力。然而,它們在需要深度推理和邏輯推斷的複雜情境中仍然存在限制。由於這些模型是基於大規模文本語料庫進行訓練的,它們的生成過程可能仍會引入無關、噪音或事實不一致的內容。為了減輕這個問題,我們引入了SGR,一個通過外部子圖生成增強LLM推理的逐步框架。SGR從外部知識庫構建查詢特定的子圖,並利用其語義結構來支持多步推理。通過將中間推理步驟基於結構化的外部知識,該框架幫助模型專注於相關的實體、關係和支持證據。特別是,SGR首先構建一個針對輸入問題量身定制的子圖。然後,它引導模型在生成的結構上逐步推理,並結合多條推理路徑來獲得最終預測。在幾個基準數據集上的實驗結果顯示,SGR在競爭基準上實現了一致的改進,突顯了其在提高推理準確性和事實可靠性方面的價值。

DebiasRAG: A Tuning-Free Path to Fair Generation in Large Language Models through Retrieval-Augmented Generation

2605.16113v1 by Rui Chu, Bingyin Zhao, Thanh Quoc Hung Le, Duy Cao Hoang, Huawei Lin, Ping Li, Weijie Zhao, Khoa D Doan, Yingjie Lao

Large language models (LLMs) have achieved unprecedented success due to their exceptional generative capabilities. However, because they depend on knowledge encapsulated from training corpora, they may produce hallucinations, stereotypes, and socially biased content. In particular, LLMs are prone to prejudiced responses involving race, gender, and age, which are collectively referred to as social biases. Prior studies have used fine-tuning and prompt engineering to mitigate such biases in LLMs, but these methods require additional training resources or domain knowledge to design the framework. Moreover, they may degrade the original capabilities of LLMs and often overlook the need for dynamic debiasing contexts for fairer inference. In this paper, we propose DebiasRAG, a novel tuning-free and dynamic query-specific debiasing framework based on retrieval-augmented generation (RAG). DebiasRAG improves fairness while preserving the intrinsic properties of LLMs, such as representation ability. DebiasRAG consists of three stages: (1) query-specific debiasing candidate generation; (2) context candidate pool construction; and (3) gradient-updated debiasing-guided context piece reranking. First, DebiasRAG leverages self-diagnosed bias contexts relevant to the query through regular retrieval, where the bias contexts are prepared offline by the DebiasRAG provider. Given the query-specific bias contexts, DebiasRAG reversely produces debiasing contexts, which are provided as additional fairness constraints for LLM outputs. Second, a regular RAG retrieval process produces query-related contexts from the regular RAG document database, such as a chunked Wikipedia dataset.

摘要:大型語言模型(LLMs)因其卓越的生成能力而取得了前所未有的成功。然而,由於它們依賴於從訓練語料庫中封裝的知識,因此可能會產生幻覺、刻板印象和社會偏見內容。特別是,LLMs 容易對涉及種族、性別和年齡的問題產生偏見回應,這些問題統稱為社會偏見。先前的研究已經使用微調和提示工程來減輕 LLMs 中的這些偏見,但這些方法需要額外的訓練資源或領域知識來設計框架。此外,它們可能會降低 LLMs 的原始能力,並且往往忽視了動態去偏見上下文的需求,以實現更公平的推理。在本文中,我們提出了 DebiasRAG,一種基於檢索增強生成(RAG)的新型無調整和動態查詢特定去偏見框架。DebiasRAG 在保持 LLMs 內在特性的同時提高了公平性,例如表徵能力。DebiasRAG 包含三個階段: (1) 查詢特定去偏見候選生成; (2) 上下文候選池構建;以及 (3) 梯度更新的去偏見引導上下文片段重新排序。首先,DebiasRAG 通過常規檢索利用與查詢相關的自我診斷偏見上下文,其中偏見上下文由 DebiasRAG 提供者離線準備。考慮到查詢特定的偏見上下文,DebiasRAG 反向生成去偏見上下文,這些上下文作為 LLM 輸出的額外公平性約束提供。其次,常規 RAG 檢索過程從常規 RAG 文檔數據庫中生成與查詢相關的上下文,例如分塊的維基百科數據集。

Multi-Level Contextual Token Relation Modeling for Machine-Generated Text Detection

2605.16107v1 by Chenwang Wu, Yiuming Cheung, Bo Han, Shuhai Zhang, Defu Lian

Machine-generated texts (MGTs) pose risks such as disinformation and phishing, underscoring the need for reliable detection. Metric-based methods, which extract statistically distinguishable features of MGTs, are often more practical than complex model-based methods that are prone to overfitting. Given their diverse designs, we first place representative metric-based methods within a unified framework, enabling a clear assessment of their advantages and limitations. Our analysis identifies a core challenge across these methods: the token-level detection score is easily biased by the inherent randomness of the MGTs generation process. Then, we theoretically derive the multi-hop transitions of the token-level detection score and explore their local and global relations. Based on these findings, we propose a multi-level contextual token relation modeling framework for MGT detection. Specifically, for local relations, we model them through a lightweight Markov-informed calibration module that refines token-level evidence before aggregation. For global relations, we introduce a rule-support reasoning module that uses explicit logical rules derived from contextual score statistics. Finally, we combine the local calibrated score and the global rule-support reasoning signal in a joint multi-level inference framework. Extensive experiments show broad and substantial improvements across various real-world scenarios, including cross-LLM and cross-domain settings, with low computational overhead.

摘要:機器生成文本(MGTs)帶來了如虛假信息和網絡釣魚等風險,突顯了可靠檢測的必要性。基於度量的方法提取MGTs的統計可區分特徵,通常比複雜的基於模型的方法更實用,後者容易過擬合。考慮到它們的多樣化設計,我們首先將代表性的基於度量的方法置於統一框架內,使其優缺點的評估更加清晰。我們的分析確定了這些方法中的一個核心挑戰:令牌級檢測分數容易受到MGTs生成過程中固有隨機性的偏見。接著,我們理論上推導了令牌級檢測分數的多跳轉換,並探索它們的局部和全局關係。基於這些發現,我們提出了一個多層次上下文令牌關係建模框架,用於MGT檢測。具體而言,對於局部關係,我們通過一個輕量級的馬爾可夫信息校準模塊來建模,該模塊在聚合之前精煉令牌級證據。對於全局關係,我們引入了一個基於規則的推理模塊,該模塊使用從上下文分數統計中推導出的明確邏輯規則。最後,我們在一個聯合多層次推理框架中結合局部校準分數和全局基於規則的推理信號。廣泛的實驗顯示,在各種現實世界場景中,包括跨LLM和跨領域設置,均有廣泛且顯著的改進,且計算開銷低。

GeoGS-CE: Learning Delay--Beam Channel Priors with 3D Gaussians for High-Mobility Scenarios

2605.16094v1 by Yumeng Zhang, Jiajia Guo, Chaozheng Wen, Chenghong Bian, Jun Zhang

Wideband channel estimation (CE) in high-mobility scenarios remains challenging because channel responses vary rapidly, while practical systems can allocate only sparse pilots to accommodate dense users. Fortunately, many high-mobility environments, such as high-speed railways, exhibit scheduled trajectories, predictable velocities, and a limited number of dominant propagation paths. These properties induce a delay--beam power spectrum that is more stable than the instantaneous complex channel frequency response (CFR), less sensitive to the random phase coherence, and rich in geometric information. To exploit such environmental properties, we propose GeoGS-CE, a two-stage channel estimation framework for sparse-pilot high-mobility scenarios. In the offline stage, GeoGS-CE jointly models: 1) a scene-level 3D Gaussian representation that captures the non-line-of-sight (NLoS) geometric scattering support, and 2) a leakage-aware differentiable wireless rendering process that maps the NLoS Gaussians, together with an explicit virtual line-of-sight (LoS) component, to the measured delay--beam power spectrum, while accounting for practical OFDM delay and array leakage effects. In the online stage, the delay--beam power spectrum is predicted for each user location and used as a strong covariance prior, enabling accurate full-band and full-array CFR reconstruction and tracking through a linear MMSE estimator. Simulations based on channels generated from a segment of the Guangshen high-speed railway show that the proposed geometric prior substantially improves CFR reconstruction over pilot-only and non-geometric baselines.

摘要:寬帶通道估計(CE)在高移動性場景中仍然具有挑戰性,因為通道響應變化迅速,而實際系統只能分配稀疏的導頻以容納密集的用戶。幸運的是,許多高移動性環境,例如高速鐵路,展現出有計劃的軌跡、可預測的速度和有限數量的主導傳播路徑。這些特性產生的延遲-波束功率譜比瞬時複雜通道頻率響應(CFR)更穩定,對隨機相位相干性不那麼敏感,並且富含幾何信息。為了利用這些環境特性,我們提出了GeoGS-CE,一個針對稀疏導頻高移動性場景的兩階段通道估計框架。在離線階段,GeoGS-CE共同建模:1)一個場景級的3D高斯表示,捕捉非視距(NLoS)幾何散射支持,和2)一個考慮洩漏的可微無線渲染過程,將NLoS高斯分佈以及一個明確的虛擬視距(LoS)組件映射到測量的延遲-波束功率譜,同時考慮實際的OFDM延遲和陣列洩漏效應。在在線階段,為每個用戶位置預測延遲-波束功率譜,並用作強大的協方差先驗,從而通過線性MMSE估計器實現準確的全頻帶和全陣列CFR重建和跟踪。基於從廣深高速鐵路的一段生成的通道的模擬顯示,所提出的幾何先驗顯著改善了相較於僅使用導頻和非幾何基準的CFR重建。

Centralized vs Decentralized Federated Learning: A trade-off performance analysis

2605.16089v1 by Chaimaa Medjadji, Guilain Leduc, Sylvain Kubler, Yves Le Traon

Federated Learning (FL) has emerged as a promising paradigm for collaborative model training across distributed edge devices while preserving data privacy especially with the huge increase amount of data due to the adoption of technologies which contributes to the growing number of IoT devices. Storing this amount of data centrally is challenging due to issues like limited communication, privacy, and regulations. FL can be Centralized (CFL), Decentralized (DFL), and Semi-decentralized (SDFL). Choosing the right FL architecture depends on the application's needs. However, very few research studies have experimentally compared these three types of architectures to not only understand the respective strengths and limitations, but also trade-offs between different performance indicators. This paper overcome this lack of analysis, conducting experimental analyses using the Fedstellar simulator, MNIST dataset, and MLP classifier.

摘要:聯邦學習(FL)已成為一種有前景的範式,旨在跨分散的邊緣設備進行協作模型訓練,同時保護數據隱私,特別是隨著技術的採用導致數據量的巨大增加,這促進了物聯網設備的增長。由於通信限制、隱私和法規等問題,集中存儲這麼大量的數據是具有挑戰性的。FL 可以是集中式(CFL)、去中心化(DFL)和半去中心化(SDFL)。選擇合適的 FL 架構取決於應用的需求。然而,只有少數研究實驗性地比較了這三種類型的架構,以了解各自的優勢和限制,以及不同性能指標之間的權衡。本文克服了這一分析的缺乏,使用 Fedstellar 模擬器、MNIST 數據集和 MLP 分類器進行實驗分析。

Towards Trustworthy and Explainable AI for Perception Models: From Concept to Prototype Vehicle Deployment

2605.16087v1 by Till Beemelmanns, Shayan Sharifi, Manas Mehrotra, Ayushman Choudhuri, Lutz Eckstein

Deep Neural Networks have become the dominant solution for Autonomous Driving perception, but their opacity conflicts with emerging Trustworthy AI guidelines and complicates safety assurance, debugging, and human oversight. While theoretical frameworks for safe and Explainable AI (XAI) exist, concrete implementations of Trustworthy AI for 3D scene understanding remain scarce. We address this gap by proposing a Trustworthy AI perception module that is remarkably robust, integrates faithful explainability, and calibrated uncertainty estimates. Building on a transformer-based detector, we derive explanation from the attention mechanism at inference time and validate their faithfulness using perturbation-based consistency tests. We further integrate an uncertainty estimation and calibration module, and apply robustness-enhancing training methods. Experiments show faithful saliency behavior, improved robustness, and well-calibrated uncertainty estimates. Finally, we deploy these Trustworthy AI elements in a prototype vehicle and provide an XAI Interface that visualizes documentation artifacts, model uncertainty state, and saliency maps, demonstrating the feasibility of trustworthy perception monitoring in real time. Supplementary materials are available at https://tillbeemelmanns.github.io/trustworthy_ai/ .

摘要:深度神經網絡已成為自動駕駛感知的主導解決方案,但其不透明性與新興的可信 AI 指導方針相衝突,並使安全保證、調試和人類監督變得複雜。雖然存在安全和可解釋 AI (XAI) 的理論框架,但針對 3D 場景理解的可信 AI 的具體實現仍然稀缺。我們通過提出一個顯著穩健的可信 AI 感知模塊來填補這一空白,該模塊集成了真實的可解釋性和經過校準的不確定性估計。基於Transformer的檢測器,我們在推理時從注意力機制中推導解釋,並使用基於擾動的一致性測試來驗證其真實性。我們進一步集成了一個不確定性估計和校準模塊,並應用增強穩健性的訓練方法。實驗顯示出真實的顯著性行為、改善的穩健性和良好校準的不確定性估計。最後,我們在一個原型車輛中部署這些可信 AI 元素,並提供一個可視化文檔工件、模型不確定性狀態和顯著性圖的 XAI 界面,展示了在實時中進行可信感知監控的可行性。補充材料可在 https://tillbeemelmanns.github.io/trustworthy_ai/ 獲得。

Towards Foundation Models for Relational Databases with Language Models and Graph Neural Networks

2605.16085v1 by Jingcheng Wu, Ratan Bahadur Thapa, Mojtaba Nayyeri, Lucas Etteldorf, Max Finkenbeiner, Fabian Leeske, Steffen Staab

Relational databases store much of the world's structured information, and they are essential for driving complex predictive applications. However, deep learning progress on relational data remains limited, as conventional approaches flatten databases into single tables via manual feature engineering, discarding relational context. Relational deep learning (RDL) addresses this by modeling databases as relational entity graphs (REGs) for graph neural networks (GNNs), but remains task- and database-specific. To combine the strengths of both paradigms, we propose a hybrid architecture combining a fine-tuned BART encoder to capture intra-row semantics with a GraphSAGE-based GNN over REGs to inject relational context. Experiments on RelBench show that the GNN substantially enriches BART's row embeddings, achieving a ROC-AUC of 67.40 on the driver-dnf task from the rel-f1 dataset. This performance is competitive with supervised baselines such as LightGBM (68.86) and narrows the gap to RDL (72.62) to within 5.22 points, though a substantial gap remains to state-of-the-art foundation models such as KumoRFM (82.63). These results suggest that lightweight hybrid LM-GNN architectures offer a promising and resource-efficient path towards foundation models for relational databases.

摘要:關聯資料庫儲存了世界上大量的結構化資訊,並且對於驅動複雜的預測應用至關重要。然而,對於關聯數據的深度學習進展仍然有限,因為傳統方法通過手動特徵工程將資料庫展平為單一表格,捨棄了關聯上下文。關聯深度學習(RDL)通過將資料庫建模為關聯實體圖(REGs)以供圖神經網絡(GNNs)使用來解決這個問題,但仍然是任務和資料庫特定的。為了結合這兩種範式的優勢,我們提出了一種混合架構,結合了微調的BART編碼器以捕捉行內語義,並在REGs上使用基於GraphSAGE的GNN來注入關聯上下文。在RelBench上的實驗顯示,GNN顯著豐富了BART的行嵌入,並在rel-f1數據集的driver-dnf任務上達到了67.40的ROC-AUC。這一表現與監督基準如LightGBM(68.86)具有競爭力,並將與RDL(72.62)之間的差距縮小至5.22點,儘管與最先進的基礎模型如KumoRFM(82.63)之間仍然存在顯著差距。這些結果表明,輕量級混合LM-GNN架構為關聯資料庫的基礎模型提供了一條有前景且資源高效的道路。

VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation

2605.16079v1 by Yiming Zhao, Yu Zeng, Wenxuan Huang, Zhen Fang, Qing Miao, Qisheng Su, Jiawei Zhao, Jiayin Cai, Lin Chen, Zehui Chen, Yukun Qi, Yao Hu, Xiaolong Jiang, Feng Zhao

Large Vision-Language Models (LVLMs) have shown significant progress in video understanding, yet they face substantial challenges in tasks requiring precise spatiotemporal localization at the instance level. Existing methods primarily rely on text prompts for human-model interaction, but these prompts struggle to provide precise spatial and temporal references, resulting in poor user experience. Furthermore, current approaches typically decouple visual perception from language reasoning, centering reasoning around language rather than visual content, which limits the model's ability to proactively perceive fine-grained visual evidence. To address these challenges, we propose VideoSeeker, a novel paradigm for instance-level video understanding through visual prompts. VideoSeeker seamlessly integrates agentic reasoning with instance-level video understanding tasks, enabling the model to proactively perceive and retrieve relevant video segments on demand. We construct a four-stage fully automated data synthesis pipeline to efficiently generate large-scale, high-quality instance-level video data. We internalize tool-calling and proactive perception capabilities into the model via cold-start supervision and RL training, building a powerful video understanding model. Experiments demonstrate that our model achieves an average improvement of +13.7% over baselines on instance-level video understanding tasks, surpassing powerful closed-source models such as GPT-4o and Gemini-2.5-Pro, while also showing effective transferability on general video understanding benchmarks. The relevant datasets and code will be released publicly.

摘要:大型視覺-語言模型(LVLMs)在視頻理解方面已顯示出顯著進展,但在需要精確時空定位的實例級任務中仍面臨重大挑戰。現有方法主要依賴文本提示進行人機互動,但這些提示難以提供精確的空間和時間參考,導致用戶體驗不佳。此外,當前的方法通常將視覺感知與語言推理解耦,將推理集中在語言而非視覺內容上,這限制了模型主動感知細粒度視覺證據的能力。為了解決這些挑戰,我們提出了VideoSeeker,一種通過視覺提示進行實例級視頻理解的新範式。VideoSeeker無縫地將主動推理與實例級視頻理解任務集成,使用戶能夠主動感知並按需檢索相關視頻片段。我們構建了一個四階段的全自動數據合成管道,以高效生成大規模、高質量的實例級視頻數據。我們通過冷啟動監督和強化學習訓練,將工具調用和主動感知能力內化到模型中,構建了一個強大的視頻理解模型。實驗表明,我們的模型在實例級視頻理解任務上平均提高了+13.7%的基準表現,超越了強大的封閉源模型,如GPT-4o和Gemini-2.5-Pro,同時在一般視頻理解基準上也顯示出有效的可轉移性。相關數據集和代碼將公開發布。

Can Large Language Models Imitate Human Speech for Clinical Assessment? LLM-Driven Data Augmentation for Cognitive Score Prediction

2605.16077v1 by Si-Belkacem Yamine Ketir, Lenard Paulo Tamayo, Shohei Hisada, Shaowen Peng, Shoko Wakamiya, Eiji Aramaki

Accurate assessment of cognitive decline from spontaneous speech remains challenging due to limited dataset size and class imbalance. In this work, we propose a large language model (LLM)-driven data augmentation framework to improve the prediction of cognitive scores from speech. Experiments are conducted on a Japanese corpus in which each participant provides both a spontaneous oral narrative and a written response to the same clinical prompt. The written responses serve as semantic anchors to generate multiple oral-like monologues in different styles using GPT-5. We then predict Hasegawa Dementia Scale scores, a widely used cognitive screening tool in Japan, using a Partial Least Squares regression model trained on Sentence-BERT speech embeddings. We investigate two augmentation strategies: random class-balanced selection, which yields moderate but unstable improvements, and similarity-guided class-balanced selection. The latter prioritizes semantically close synthetic samples, leading to more consistent improvements and substantially reducing prediction error for minority low-score participants while maintaining performance for the majority group. Overall, our findings demonstrate the potential of semantically guided LLM-driven augmentation as a principled approach for addressing class imbalance and improving data efficiency in clinical speech analysis.

摘要:準確評估自發性言語中的認知衰退仍然具有挑戰性,因為數據集大小有限且類別不平衡。在這項工作中,我們提出了一個大型語言模型(LLM)驅動的數據增強框架,以改善從言語中預測認知分數的能力。實驗是在一個日本語料庫上進行的,每位參與者提供自發的口頭敘述和對相同臨床提示的書面回應。書面回應作為語義錨點,使用GPT-5生成多種風格的口語化獨白。我們然後使用基於Sentence-BERT語音嵌入訓練的偏最小二乘回歸模型來預測長谷川癡呆量表分數,這是一種在日本廣泛使用的認知篩查工具。我們研究了兩種增強策略:隨機類別平衡選擇,這產生了適度但不穩定的改進,以及相似性引導的類別平衡選擇。後者優先考慮語義上接近的合成樣本,導致更一致的改進,並顯著減少少數低分參與者的預測誤差,同時保持大多數群體的表現。總體而言,我們的研究結果顯示了語義引導的LLM驅動增強作為解決類別不平衡和改善臨床言語分析數據效率的原則性方法的潛力。

AgriMind: An Ensemble Deep Learning Framework for Multi-Class Plant Disease Classification

2605.16076v1 by Salma Hoque Talukdar Koli, Fahima Haque Talukder Jely

Plant disease detection is still largely manual in Bangladesh, where extension workers eyeball leaf samples across millions of smallholdings. We built AgriMind to automate this: an ensemble of ResNet50, EfficientNet-B0, and DenseNet121 trained on 20,638 PlantVillage images across 15 pepper, potato, and tomato disease classes. Transfer learning with frozen ImageNet backbones and 10 epochs of head-only training keeps the pipeline lightweight. Individual models hit 96--97% on the held-out test set, but averaging their softmax outputs pushes the ensemble to 99.23% -- a two-thirds cut in error rate. We tried biasing the average toward the best validation model; it backfired. Dropping any single model also hurt. Pepper and potato classify perfectly; tomato, with ten visually similar classes, still reaches 99.01%. On an NVIDIA T4 GPU the full ensemble runs at 53 FPS. Whether that translates to real-time mobile use depends on TensorFlow Lite optimization -- work we have not yet completed.

摘要:植物病害檢測在孟加拉國仍然主要依賴人工,擴展工作者需要在數百萬個小農場中檢查葉片樣本。我們建立了 AgriMind 來自動化這一過程:它是一個由 ResNet50、EfficientNet-B0 和 DenseNet121 組成的集成模型,訓練於 20,638 張來自 PlantVillage 的圖像,涵蓋 15 種辣椒、馬鈴薯和番茄病害類別。通過凍結 ImageNet 的主幹並進行 10 個時期的僅頭部訓練,保持了管道的輕量化。單個模型在保留的測試集上達到 96--97%,但平均它們的 softmax 輸出使集成模型達到 99.23% 的準確率,錯誤率減少了三分之二。我們嘗試將平均值偏向最佳驗證模型,但結果適得其反。去掉任何單一模型也會造成損失。辣椒和馬鈴薯的分類準確率完美;而番茄則有十個視覺上相似的類別,仍然達到 99.01%。在 NVIDIA T4 GPU 上,整個集成模型以 53 FPS 的速度運行。這是否能轉化為實時移動使用取決於 TensorFlow Lite 的優化——這項工作我們尚未完成。

Robust Prior-Guided Segmentation for Editable 3D Gaussian Splatting

2605.16065v1 by Raushan Joshi, Jean-Yves Guillemaut

3D Gaussian Splatting (3D-GS) enables real-time 3D scene reconstruction but lacks robust segmentation for editing tasks such as object removal, extraction, and recoloring. Existing approaches that lift 2D segmentations to the 3D domain suffer from view inconsistencies and coarse masks. In this paper, we propose a novel framework that leverages the Segment Anything Model High Quality (SAM-HQ) to generate accurate 2D masks, addressing the limitations of the standard SAM in boundary fidelity and fine-structure preservation. To achieve robust 3D segmentation of any target object in a given scene, we introduce a prior-guided label reassignment method that assigns labels to 3D Gaussians by enforcing multiview consistency with learned priors. Our approach achieves state-of-the-art segmentation accuracy and enables interactive, real-time object editing while maintaining high visual fidelity. Qualitative results demonstrate superior boundary preservation and practical utility in Virtual Reality (VR) and robotics, advancing 3D scene editing.

摘要:3D Gaussian Splatting (3D-GS) 使得實時的 3D 場景重建成為可能,但在對象移除、提取和重新上色等編輯任務中缺乏穩健的分割。現有的將 2D 分割提升至 3D 領域的方法存在視角不一致和粗糙遮罩的問題。在本文中,我們提出了一個新穎的框架,利用 Segment Anything Model High Quality (SAM-HQ) 生成準確的 2D 遮罩,解決了標準 SAM 在邊界保真度和細微結構保留方面的限制。為了實現對給定場景中任何目標物體的穩健 3D 分割,我們引入了一種基於先驗的標籤重新分配方法,通過強化與學習到的先驗的多視角一致性來為 3D 高斯分佈分配標籤。我們的方法實現了最先進的分割準確性,並實現了互動式的實時對象編輯,同時保持高視覺保真度。定性結果展示了卓越的邊界保護和在虛擬現實 (VR) 和機器人技術中的實用性,推進了 3D 場景編輯。

Misspecified Explore-then-Exploit Leads to Supra-Competitive Prices

2605.16064v1 by Jackie Baek, Vivek F. Farias, Farrell Wu

We study whether simple algorithmic pricing systems can systematically produce collusive-like prices in multi-firm markets. We consider firms using an explore-then-exploit pipeline: they randomize prices during an initial exploration phase, then estimate demand from their own historical data and set prices myopically thereafter. The estimation step relies on a misspecified, monopoly-style model that omits competitors' prices. We characterize when this pipeline converges to supra-competitive prices above the Nash equilibrium, via a fluid-limit ordinary differential equation analysis. We show that supra-competitive prices arise when firms explore within similar price ranges on the same side of the Nash price. Moreover, prices can be substantially above the Nash price; we show that prices can reach monopoly levels under symmetric exploration. Simulations calibrated to a real multifamily rental market confirm that supra-competitive outcomes arise robustly beyond our theoretical assumptions, including under finite horizons, heterogeneous products, and nonlinear logit demand.

摘要:我們研究簡單的算法定價系統是否能在多廠商市場中系統性地產生類似共謀的價格。我們考慮廠商使用探索-然後開採的流程:它們在初始探索階段隨機定價,然後根據自身的歷史數據估計需求,並在此後短視地設定價格。估計步驟依賴於一個錯誤規範的壟斷風格模型,該模型忽略了競爭對手的價格。我們通過流體極限常微分方程分析來描述這個流程何時會收斂到超競爭價格,超過納什均衡。我們顯示,當廠商在納什價格的同一側的相似價格範圍內進行探索時,超競爭價格會出現。此外,價格可以顯著高於納什價格;我們顯示在對稱探索下,價格可以達到壟斷水平。針對真實的多家庭租賃市場進行的模擬確認,超競爭結果在我們的理論假設之外穩健地出現,包括在有限的時間範圍內、異質產品和非線性邏輯需求下。

Ada-Diffuser: Latent-Aware Adaptive Diffusion for Decision-Making

2605.16054v1 by Fan Feng, Selena Ge, Minghao Fu, Zijian Li, Yujia Zheng, Zeyu Tang, Yingyao Hu, Biwei Huang, Kun Zhang

Recent work has framed decision-making as a sequence modeling problem using generative models such as diffusion models. Although promising, these approaches often overlook latent factors that exhibit evolving dynamics, elements that are fundamental to environment transitions, reward structures, and high-level agent behavior. Explicitly modeling these hidden processes is essential for both precise dynamics modeling and effective decision-making. In this paper, we propose a unified framework that explicitly incorporates latent dynamic inference into generative decision-making from minimal yet sufficient observations. We theoretically show that under mild conditions, the latent process can be identified from small temporal blocks of observations. Building on this insight, we introduce Ada-Diffuser, a causal diffusion model that learns the temporal structure of observed interactions and the underlying latent dynamics simultaneously, and furthermore, leverages them for planning and control. With a modular design, Ada-Diffuser supports both planning and policy learning tasks, enabling adaptation to latent variations in dynamics, rewards, and latent actions. Experiments on simulated control and robotic benchmarks demonstrate its effectiveness in accurate latent inference and adaptive policy learning.

摘要:最近的研究將決策制定框架化為一個序列建模問題,使用生成模型如擴散模型。儘管這些方法展現出希望,但它們常常忽略了顯示出演變動態的潛在因素,這些因素對環境轉變、獎勵結構和高層次代理行為是基本的。明確建模這些隱藏過程對於精確的動態建模和有效的決策制定至關重要。在本文中,我們提出了一個統一框架,明確將潛在動態推斷納入從最少但足夠的觀察中生成的決策制定。我們理論上表明,在輕微條件下,潛在過程可以從小的時間區塊觀察中識別出來。基於這一見解,我們引入了Ada-Diffuser,一種因果擴散模型,能夠同時學習觀察互動的時間結構和潛在動態,並進一步利用它們進行規劃和控制。通過模組化設計,Ada-Diffuser支持規劃和政策學習任務,使其能夠適應動態、獎勵和潛在行動中的潛在變化。在模擬控制和機器人基準測試中的實驗展示了其在準確潛在推斷和自適應政策學習方面的有效性。

Reasoners or Translators? Contamination-aware Evaluation and Neuro-Symbolic Robustness in Tax Law

2605.16052v1 by Parisa Kordjamshidi, Samer Aslan, Madhavan Seshadri, Leslie Barrett, Enrico Santus

Recent advances in large language models (LLMs) have significantly enhanced automated legal reasoning. Yet, it remains unclear whether their performance reflects genuine legal reasoning ability or artifacts of data contamination. We present a comprehensive empirical study of tax law reasoning approaches and implement a contamination detection protocol to rigorously assess LLM reliability. We show that performance can be inflated by contamination. Building on this analysis, we conduct a systematic evaluation, comparing monolithic LLMs with hybrid systems that translate statutory text into formal representations and delegate inference to symbolic solvers. We build a novel test suite designed to probe generalization to unseen documents via case and rule variations. Our findings indicate that legal reasoning is inherently compositional and that neuro-symbolic frameworks offer a more reliable and robust foundation for legal AI, as well as improved generalization to unobserved situations.

摘要:最近在大型語言模型(LLMs)方面的進展顯著提升了自動法律推理的能力。然而,目前仍不清楚它們的表現是否反映了真正的法律推理能力,或是數據污染的產物。我們提出了一項全面的實證研究,針對稅法推理方法進行探討,並實施了一個污染檢測協議,以嚴格評估LLM的可靠性。我們顯示,表現可能因污染而被膨脹。基於這一分析,我們進行了系統評估,將單一的LLM與將法規文本轉換為正式表示並將推理委派給符號求解器的混合系統進行比較。我們建立了一個新穎的測試套件,旨在通過案例和規則變化探測對未見文檔的泛化能力。我們的研究結果表明,法律推理本質上是組合性的,神經符號框架為法律AI提供了更可靠和穩健的基礎,並且對未觀察情況的泛化能力有所提升。

Looped SSMs: Depth-Recurrence and Input Reshaping for Time Series Classification

2605.16048v1 by Mónika Farsang, Ramin Hasani, Daniela Rus, Radu Grosu

State Space Models (SSMs) are inherently recurrent along the sequence dimension, yet depth-recurrence - reusing the same block repeatedly across layers, as recently applied in looped transformers - has not been explored in this model family. We show that a looped SSM with $k$ parameters iterated $L$ times consistently closely matches or outperforms a standard SSM with $k \cdot L$ independent parameters across four architectures (LRU, S5, LinOSS, LrcSSM) and six time series classification benchmarks, despite operating within a strictly smaller hypothesis space, as we formally establish. Since the larger model contains the looped model as a special case, this dominance cannot be explained by expressivity and instead points to parameter sharing across depth as a beneficial inductive bias that simplifies optimization. These results demonstrate that depth-recurrence is orthogonal to sequence-recurrence and independently beneficial. We further show that input reshaping is an equally neglected design axis: concatenating timesteps for low-dimensional inputs, or flattening and rechunking the joint feature-time dimension for high-dimensional ones, yields accuracy gains of 1-6% across all models, confirmed over 5 random seeds. Both techniques provide standalone improvements that compound when combined, suggesting that depth and input reshaping are two independent and underexplored design axes for SSMs on time series.

摘要:狀態空間模型(SSMs)在序列維度上本質上是循環的,但深度循環——在層之間重複使用相同的區塊,正如最近在循環Transformer中應用的那樣——在這個模型家族中尚未被探索。我們顯示,具有 $k$ 參數的循環 SSM 迭代 $L$ 次,始終與具有 $k \cdot L$ 獨立參數的標準 SSM 在四個架構(LRU、S5、LinOSS、LrcSSM)和六個時間序列分類基準上緊密匹配或超越,儘管在一個嚴格較小的假設空間內運作,這一點我們已正式確立。由於較大的模型包含循環模型作為特例,因此這種優勢無法用表達能力來解釋,而是指向深度之間的參數共享作為一種有益的歸納偏見,簡化了優化過程。這些結果表明,深度循環與序列循環是正交的,並且各自獨立地有益。我們進一步顯示,輸入重塑是一個同樣被忽視的設計軸:對於低維輸入,連接時間步,或對於高維輸入,展平並重新分塊聯合特徵-時間維度,均可在所有模型上獲得 1-6% 的準確率提升,並在 5 次隨機種子上得到確認。這兩種技術提供了獨立的改進,當結合時會相互增強,這表明深度和輸入重塑是 SSM 在時間序列上兩個獨立且未充分探索的設計軸。

XSearch: Explainable Code Search via Concept-to-Code Alignment

2605.16046v1 by Yiming Liu, Ruofan Liu, Yun Lin, Zicong Zhang, Weiyu Kong, Pengnian Qi, Xiao Cheng, Weinan Zhang, Qianxiang Wang, Linpeng Huang

Semantic code search has been widely adopted in both academia and industry. These approaches embed natural-language queries and code snippets into a shared embedding space and retrieve results based on vector similarity. Despit strong performance on benchmark datasets, they often suffer from poor explainability and generalization. Retrieved code may appear semantically similar yet miss critical functional requirements of the query, while providing no explanation of why the result was retrieved. Moreover, such failures become more severe under distribution shift, where models struggle to generalize to unseen benchmarks. In this work, we propose XSearch, an intrinsically explainable code search framework. Our key insight is that by relying on global embedding similarity, existing retrievers inherently take an inductive view. They learn statistical patterns rather than truly understanding the query's functional requirements. We address this problem by reformulating code search as a deductive concept alignment problem. XSearch (i) identifies functional concepts in the query and (ii) explicitly aligns them with corresponding code statements. This explain-then-predict design produces inherent concept-level explanations and mitigates shortcut learning that harms out-of-distribution generalization. We train an encoder with explicit concept-alignment objectives and perform retrieval through explicit matching between query concepts and code statements. Experiments show that, trained on CodeSearchNet using GraphCodeBERT (125M parameters), XSearch improves performance on out-of-distribution benchmarks from 0.02 to 0.33 (15x) over eight state-of-the-art retrievers, and consistently outperforms both encoder- and decoder-based baselines with up to 7B parameters. A user study demonstrates that concept-alignment explanations enable users to evaluate retrieved results faster and more accurately.

摘要:語義代碼搜索在學術界和工業界已被廣泛採用。這些方法將自然語言查詢和代碼片段嵌入到共享的嵌入空間中,並根據向量相似性檢索結果。儘管在基準數據集上表現強勁,但它們往往在可解釋性和泛化能力上表現不佳。檢索到的代碼可能在語義上相似,但卻錯過了查詢的關鍵功能需求,同時也沒有提供為何檢索到該結果的解釋。此外,這種失敗在分佈轉移下變得更加嚴重,模型在未見的基準上難以泛化。在這項工作中,我們提出了XSearch,一個內在可解釋的代碼搜索框架。我們的關鍵見解是,依賴於全局嵌入相似性,現有的檢索器本質上採取了歸納視角。它們學習統計模式,而不是實際理解查詢的功能需求。我們通過將代碼搜索重新表述為演繹概念對齊問題來解決這個問題。XSearch (i) 確定查詢中的功能概念,並 (ii) 明確將它們與相應的代碼語句對齊。這種先解釋然後預測的設計產生了內在的概念級解釋,並減輕了損害分佈外泛化的捷徑學習。我們訓練了一個具有明確概念對齊目標的編碼器,並通過查詢概念與代碼語句之間的明確匹配進行檢索。實驗顯示,在使用GraphCodeBERT(125M參數)訓練的CodeSearchNet上,XSearch在八個最先進的檢索器上將分佈外基準的性能從0.02提升至0.33(15倍),並且始終超越了高達7B參數的編碼器和解碼器基準。用戶研究顯示,概念對齊解釋使得用戶能夠更快且更準確地評估檢索結果。

RecMem: Recurrence-based Memory Consolidation for Efficient and Effective Long-Running LLM Agents

2605.16045v1 by Zijie Dai, Shiyuan Deng, Sheng Guan, Yizhou Tian, Xin Yao, Xiao Yan, James Cheng

Memory systems often organize user-agent interactions as retrievable external memory and are crucial for long-running agents by overcoming the limited context windows of LLMs. However, existing memory systems invoke LLMs to process every incoming interaction for memory extraction, and such an eager memory consolidation scheme leads to substantial token consumption. To tackle this problem, we propose RecMem by rethinking when memory consolidation should be conducted. RecMem stores incoming interactions in a subconscious memory layer and encode them using lightweight embedding models for retrieval. LLMs are only invoked to extract episodic and semantic memory when sustained recurrence are observed for semantically similar interactions. Such recurrence-based consolidation works because these interactions correspond to a semantic cluster with rich information and thus are worth extraction and summarization. To improve accuracy, RecMem also incorporates a semantic refinement mechanism that recovers the fine-grained facts omitted by memory extraction. Experiments show that RecMem reduces the memory construction token cost of three SOTA memory systems by up to 87% while exceeding their accuracy.

摘要:記憶系統通常將使用者代理互動組織為可檢索的外部記憶,並且對於長期運行的代理至關重要,因為它克服了大型語言模型(LLMs)有限的上下文窗口。然而,現有的記憶系統會調用LLMs來處理每個進來的互動以進行記憶提取,而這種急切的記憶整合方案導致了大量的標記消耗。為了解決這個問題,我們提出了RecMem,重新思考記憶整合應該在何時進行。RecMem將進來的互動存儲在潛意識記憶層中,並使用輕量級嵌入模型對其進行編碼以便檢索。只有在觀察到語義相似的互動持續重複時,才會調用LLMs來提取情節記憶和語義記憶。這種基於重複的整合有效,因為這些互動對應於一個包含豐富信息的語義集群,因此值得進行提取和總結。為了提高準確性,RecMem還結合了一個語義精煉機制,恢復被記憶提取省略的細粒度事實。實驗顯示,RecMem將三個最先進記憶系統的記憶構建標記成本降低了多達87%,同時超越了它們的準確性。

Who Owns This Agent? Tracing AI Agents Back to Their Owners

2605.16035v1 by Ruben Chocron, Doron Jonathan Ben Chayim, Eyal Lenga, Gilad Gressel, Alina Oprea, Yisroel Mirsky

AI agents are increasingly deployed to act autonomously in the world, yet there is still no reliable way to trace a harmful agent back to the account that deployed it. This creates the same accountability gap across both ends of the intent spectrum: benign operators may deploy misconfigured or overbroad agents that cause harm unintentionally, while malicious operators may deliberately weaponize agents for scams, harassment, or cyber attacks. In many cases, these agents are powered by vendor-hosted models, a dependency that holds even for sophisticated adversaries such as state actors conducting cyber operations. In either case, affected parties can observe the behavior but cannot notify the responsible operator, stop the session, or identify the account for investigation. We formalize this gap as the problem of agent attribution: linking an observed agent interaction to the responsible account at the hosting vendor. To our knowledge, this is the first work to define the problem and present a practical solution. Our protocol is canary-based: an authorized party injects a canary into the agent's interaction stream, and the vendor searches a narrow window of session logs to recover the originating session and account. Simple canaries suffice in non-adversarial settings. For adversarial operators who filter or paraphrase incoming content, we develop robust canary constructions that cannot be suppressed without degrading the agent's own task performance, yielding a formal asymmetry in the defender's favor. We evaluate a variety of scenarios including real-world agents and show that our attribution method is reliable, robust, and scalable for vendor-side deployment.

摘要:AI代理越來越多地被部署以在世界上自主行動,但仍然沒有可靠的方法可以追溯有害代理到部署它的帳戶。這在意圖光譜的兩端創造了相同的問責差距:良性操作員可能會部署配置錯誤或過於廣泛的代理,無意中造成傷害,而惡意操作員則可能故意將代理武器化,用於詐騙、騷擾或網絡攻擊。在許多情況下,這些代理由供應商托管的模型驅動,即使是進行網絡行動的國家行為者等複雜對手也不例外。在任何情況下,受影響方可以觀察行為,但無法通知負責的操作員、停止會話或識別帳戶以進行調查。我們將這一差距正式化為代理歸屬問題:將觀察到的代理互動鏈接到托管供應商的負責帳戶。據我們所知,這是第一項定義該問題並提出實用解決方案的工作。我們的協議是基於金絲雀的:授權方將金絲雀注入代理的互動流中,供應商搜索狹窄的會話日誌窗口以恢復原始會話和帳戶。在非對抗性環境中,簡單的金絲雀就足夠了。對於過濾或改寫進入內容的對抗性操作員,我們開發了穩健的金絲雀構造,這些構造在不降低代理自身任務性能的情況下無法被壓制,從而在防禦者的利益中產生正式的不對稱性。我們評估了各種場景,包括現實世界的代理,並展示了我們的歸屬方法在供應商端部署中是可靠的、穩健的和可擴展的。

From Flat Language Labels to Typological Priors: Structured Language Conditioning for Multilingual Speech-to-Speech Translation

2605.16026v1 by Yu Pan, Yang Hou, Xiongfei Wu, Liang Zhang, Yves Le Traon, Lei Ma, Jianjun Zhao

Compositional speech-to-speech translation (S2ST) systems built upon speech large language models (SpeechLLMs) have recently shown promising performance. However, existing S2ST systems often either neglect source-language information or encode it through a language-as-label paradigm, representing each source language as an independent flat embedding. Such a design overlooks systematic linguistic structure shared across languages, which may limit data-efficient multilingual adaptation when supervised S2ST data are scarce. To address this issue, we propose S2ST-Omni 2, a many-to-one compositional S2ST framework that systematically reformulates multilingual language conditioning from flat language labels to structured typological priors. Specifically, S2ST-Omni 2 revisits language conditioning at three levels: typology-informed hierarchical language encoding for structured source-language representation, dynamically-gated language-aware Dual-CTC for content-adaptive acoustic modulation, and typology-aware LLM prompting for decoder-side linguistic guidance. Experiments on CVSS-C show that S2ST-Omni 2 achieves superior average performance among representative S2ST approaches across BLEU, COMET, ASR-BLEU, and BLASER 2.0 under the adopted evaluation protocol. Ablation studies indicate that the proposed representation-level, acoustic-level, and decoding-level strategies provide complementary benefits. Moreover, controlled data-budget analyses and a Japanese-to-English evaluation using only approximately 3 hours of supervised training data suggest that explicit typological priors provide useful inductive biases for data-efficient multilingual S2ST.

摘要:組合語音到語音翻譯(S2ST)系統基於語音大型語言模型(SpeechLLMs)最近顯示出良好的性能。然而,現有的 S2ST 系統往往要麼忽略源語言信息,要麼通過語言作為標籤的範式進行編碼,將每個源語言表示為獨立的平坦嵌入。這樣的設計忽略了跨語言共享的系統性語言結構,這可能在監督的 S2ST 數據稀缺時限制數據高效的多語言適應。為了解決這個問題,我們提出了 S2ST-Omni 2,一個多對一的組合 S2ST 框架,系統性地將多語言語言條件從平坦的語言標籤重新構造為結構化的類型學先驗。具體而言,S2ST-Omni 2 在三個層面上重新審視語言條件:基於類型學的信息層次語言編碼以實現結構化的源語言表示、動態門控的語言感知雙重 CTC 用於內容自適應的聲學調製,以及基於類型學的 LLM 提示用於解碼器端的語言指導。在 CVSS-C 上的實驗顯示,S2ST-Omni 2 在採用的評估協議下,在 BLEU、COMET、ASR-BLEU 和 BLASER 2.0 等代表性 S2ST 方法中實現了優越的平均性能。消融研究表明,所提出的表示層、聲學層和解碼層策略提供了互補的好處。此外,受控數據預算分析和僅使用約 3 小時監督訓練數據的日語到英語評估表明,明確的類型學先驗為數據高效的多語言 S2ST 提供了有用的歸納偏見。

Judge Circuits

2605.16023v1 by Nils Feldhus, Tanja Baeumel, Elena Golimblevskaia, Qianli Wang, Van Bach Nguyen, Aaron Louis Eidt, Christopher Ebert, Wojciech Samek, Jing Yang, Vera Schmitt, Sebastian Möller, Simon Ostermann

LLM-as-a-judge has become the dominant paradigm for grading model outputs at scale, yet the same model assigns systematically different scores when its output format changes (e.g., a 1-5 rating vs. a True/False label). Existing diagnoses of these format-induced inconsistencies stop at the input-output level. Using Position-aware Edge Attribution Patching (PEAP), we causally investigate the internal mechanism in Gemma-3, Qwen2.5, and Llama-3. We find that judgments across structured understanding and open-ended preference tasks share a sparse, generalized Latent Evaluator sub-graph in the mid-to-late multi-layer perceptrons (MLPs); zero-ablating it collapses judgment while preserving world knowledge in architecturally modular models. By structurally decoupling abstract judging from output formatting, we provide a mechanistic account of format-induced inconsistency on the open-weight models we study: a continuous judgment signal computed in the shared trunk is mapped through fragile, format-specific terminal branches, enabling format-independent preference to be isolated downstream of the requested output format. Our findings imply that benchmark-level reliability comparisons across formats are partially measuring formatter geometry rather than evaluation quality.

摘要:LLM-as-a-judge 已成為大規模評分模型輸出的主流範式,然而當其輸出格式改變時(例如,1-5 評分與真/假標籤),同一模型卻會系統性地分配不同的分數。現有對這些格式引起的不一致性的診斷僅停留在輸入-輸出層面。使用位置感知邊緣歸因修補(PEAP),我們對 Gemma-3、Qwen2.5 和 Llama-3 的內部機制進行了因果調查。我們發現,在結構化理解和開放式偏好任務中的判斷,分享了一個稀疏的、通用的潛在評估子圖,位於中到後期的多層感知器(MLPs)中;將其完全去除會導致判斷崩潰,同時在架構模塊化模型中保留世界知識。通過在結構上將抽象判斷與輸出格式解耦,我們提供了一個關於我們研究的開放權重模型中格式引起不一致性的機制解釋:在共享主幹中計算的連續判斷信號,通過脆弱的、格式特定的終端分支進行映射,使得格式獨立的偏好能夠在請求的輸出格式的下游被孤立。我們的發現暗示,跨格式的基準級可靠性比較部分上是在測量格式化幾何而非評估質量。

Can Vision Language Models Be Adaptive in Mathematics Education? A Learner Model-based Rubric Study

2605.16011v1 by Jie Gao, Yongan Yu, Junzhu Su, Yiran Lin, Adam K. Dube, Jackie Chi Kit Cheung

Adaptive learning refers to educational technologies that track learners' learning progress and adapt the instructional process based on individual learners' learning performance. It is increasingly recognized as critical for developing an effective learning support tool. Vision language models (VLMs) have seen adoption in mathematics education, and students have been using them as learning aids for personalized instruction. However, it is unknown whether VLMs have the ability to adapt to different learner profiles when providing mathematical instructions. Current VLMs lack a systematic evaluation framework for this adaptivity to different learner profiles in mathematics tutoring tasks. To address this gap, we draw on the learner model from the adaptive learning framework (Shute and Towle, 2018) and propose a learner model-based rubric. Our rubric formalizes adaptivity assessment into three aspects: cognitive aspects, motivational aspects, and complexity. We also evaluate two additional dimensions of VLM responses: correctness (of answers and solutions) and quality (of the response itself). Our experimental results show measurable differences in adaptivity across models and also reveal that current VLMs struggle to consistently produce learner model-based instructional responses, especially when receiving limited learner information.

摘要:適應性學習是指追蹤學習者學習進度並根據個別學習者的學習表現調整教學過程的教育技術。越來越多的人認識到這對於開發有效的學習支持工具至關重要。視覺語言模型(VLMs)在數學教育中得到了應用,學生們將其用作個性化教學的學習輔助工具。然而,目前尚不清楚 VLMs 在提供數學指導時是否能夠適應不同的學習者特徵。當前的 VLMs 缺乏針對數學輔導任務中對不同學習者特徵的適應性進行系統評估的框架。為了解決這一問題,我們借鑒了適應性學習框架中的學習者模型(Shute 和 Towle, 2018),並提出了一個基於學習者模型的評估標準。我們的評估標準將適應性評估形式化為三個方面:認知方面、動機方面和複雜性。我們還評估了 VLM 回應的兩個額外維度:正確性(答案和解決方案的正確性)和質量(回應本身的質量)。我們的實驗結果顯示,不同模型之間的適應性存在可測量的差異,並且還揭示了當前的 VLMs 在根據學習者模型生成教學回應時面臨困難,特別是在接收到有限的學習者信息時。

CitePrism: Human-in-the-Loop AI for Citation Auditing and Editorial Integrity

2605.16000v1 by Gowrika Mahesh, Budanur Madappa Darshan Gowda, Kavana Gopladevarahalli Papegowda, Prajwal Basavaraj, Binh Vu, Swati Chandna, Mehrdad Jalali

Editors and reviewers are expected to ensure that manuscripts cite relevant, accurate, current, and ethically appropriate literature, yet manuscript-level citation auditing remains largely manual, fragmented, and difficult to scale. Citation context, metadata quality, self-citation patterns, and bibliographic integrity all affect whether a reference appropriately supports a local claim. We present CitePrism, a transparent hybrid decision-support framework for editorial citation auditing that combines LLM-assisted contextual reasoning, embedding-based semantic similarity, metadata verification, integrity-oriented flags, and human-in-the-loop analyst review. CitePrism extracts citation neighborhoods, enriches reference metadata, computes fused relevance scores, surfaces metadata and self-citation review prompts, and supports configurable threshold-based triage. In a preliminary validation on a single case-study manuscript with 104 references from pavement engineering, agreement with human binary relevance labels reached Cohen's kappa = 0.429. At operating threshold tau = 17, CitePrism flagged all human-labeled irrelevant citations, while also producing false positives requiring analyst review. These results suggest that CitePrism may support conservative editorial screening and citation-quality triage, but they do not establish general editorial performance. CitePrism is intended as pilot-stage decision support, not as an autonomous misconduct detector or automated editorial decision system. Broader validation across manuscripts, domains, annotators, baselines, and deployment settings is required before operational use.

摘要:編輯和審稿人被期望確保手稿引用相關、準確、最新且倫理上合適的文獻,但手稿層級的引用審核仍然主要是手動的、零散的,且難以擴展。引用上下文、元數據質量、自我引用模式和書目完整性都會影響一個參考文獻是否適當地支持當地的主張。我們提出了CitePrism,一個透明的混合決策支持框架,用於編輯引用審核,結合了LLM輔助的上下文推理、基於嵌入的語義相似性、元數據驗證、完整性導向的標誌以及人類參與的分析師審查。CitePrism提取引用鄰域,豐富參考文獻的元數據,計算融合的相關性分數,顯示元數據和自我引用審查提示,並支持可配置的基於閾值的分流。在對一個包含104個參考文獻的單一案例研究手稿進行的初步驗證中,與人類二元相關性標籤的協議達到了Cohen's kappa = 0.429。在操作閾值tau = 17時,CitePrism標記了所有人類標記的無關引用,同時也產生了需要分析師審查的假陽性。這些結果表明CitePrism可能支持保守的編輯篩選和引用質量分流,但並未建立一般的編輯性能。CitePrism旨在作為試點階段的決策支持,而不是作為自主的失職檢測器或自動編輯決策系統。在操作使用之前,需要在手稿、領域、標註者、基準和部署設置中進行更廣泛的驗證。

Constrained latent state modeling: A unifying perspective on representation learning under competing constraints

2605.15995v1 by Gwenolé Quellec

Learning latent representations from complex data is central to modern machine learning, spanning temporal, multimodal, and partially observed systems. In such settings, representations are better understood as latent states capturing underlying system dynamics, rather than as mere compressed summaries of observations. Yet current approaches remain fragmented, relying on distinct -- and often implicit -- assumptions about what these states should represent. We argue that this fragmentation reflects a more fundamental limitation: latent representations are typically learned from underconstrained objectives that fail to specify the properties that meaningful latent states should satisfy. As a result, multiple representations can satisfy the same objective, leading to ambiguity in their structure and interpretation. While many of the underlying principles have been explored in isolation, their interactions have not been explicitly formalized. In this work, we propose constrained latent state modeling (CLSM) as a unifying perspective. We identify a set of core properties -- predictive sufficiency, minimality, temporal coherence, observation compatibility, invariance to nuisance factors, and structural constraints -- and show that they are intrinsically coupled through fundamental trade-offs. Revisiting major modeling families through this lens, we show that existing approaches can be interpreted as enforcing different subsets of constraints, thereby occupying distinct regions of a common design space. This perspective reframes persistent challenges such as lack of identifiability as consequences of underconstrained formulations, rather than isolated technical limitations. More broadly, CLSM provides a principled framework to make design choices explicit, to analyze trade-offs, and to guide the development of more interpretable, robust, and task-aligned latent state models.

摘要:從複雜數據中學習潛在表示是現代機器學習的核心,涵蓋了時間性、多模態和部分觀察系統。在這樣的環境中,表示更應被理解為捕捉潛在系統動態的潛在狀態,而不僅僅是觀察的壓縮摘要。然而,目前的方法仍然是片段化的,依賴於對這些狀態應該代表什麼的不同且往往隱含的假設。我們認為這種片段化反映了一個更根本的限制:潛在表示通常是從無約束的目標中學習的,這些目標未能具體說明有意義的潛在狀態應滿足的屬性。因此,多個表示可以滿足相同的目標,導致其結構和解釋上的模糊性。雖然許多基本原則已經被孤立地探討,但它們之間的相互作用尚未被明確形式化。在這項工作中,我們提出了約束潛在狀態建模(CLSM)作為一種統一的視角。我們確定了一組核心屬性——預測充分性、最小性、時間一致性、觀察相容性、對干擾因素的不變性和結構約束——並顯示它們通過基本的權衡本質上是相互耦合的。通過這個視角重新審視主要建模家族,我們表明現有的方法可以解釋為強制執行不同子集的約束,從而佔據共同設計空間的不同區域。這一視角將持續存在的挑戰,例如缺乏可識別性,重新框架為無約束公式的結果,而不是孤立的技術限制。更廣泛地說,CLSM提供了一個原則性框架,使設計選擇明確化,分析權衡,並指導開發更具可解釋性、穩健性和任務對齊的潛在狀態模型。

Defining Cultural Capabilities for AI Evaluation: A Taxonomy Grounded in Intercultural Communication Theory

2605.15990v1 by Isar Nejadgholi, Masoud Kianpour, Krishnapriya Vishnubhotla, Maryam Molamohamadi

Tremendous efforts have been put into evaluating the inclusivity and effectiveness of AI systems across cultures. However, the cultural capabilities considered in much of the literature remain vaguely defined, are referred to using interchangeable terminology, and are typically limited to recalling accurate information about various demographics, regions, and nationalities. To address this construct ambiguity, we draw from Intercultural Communication scholarship and propose a three-level taxonomy of AI-relevant cultural capabilities: Cultural Awareness answers "Does the model know?", Cultural Sensitivity answers "How does it frame its knowledge?", and Cultural Competence answers "Can it adapt as the interaction evolves?". Beyond conceptual clarification, we position this taxonomy as a practical tool for improving the validity and interpretability of AI evaluation in real-world, multicultural settings. Without such construct clarity, evaluation results risk overstating model capabilities and may lead to inappropriate deployment decisions in culturally sensitive contexts.

摘要:在評估人工智慧系統在不同文化中的包容性和有效性方面,付出了巨大的努力。然而,許多文獻中考慮的文化能力仍然定義模糊,使用可互換的術語來描述,通常僅限於回憶有關各種人口統計、地區和國籍的準確信息。為了解決這一構念模糊性,我們借鑒跨文化交流的學術研究,提出了一個與人工智慧相關的文化能力的三級分類法:文化意識回答「模型知道嗎?」文化敏感性回答「它如何框架其知識?」文化能力回答「隨著互動的演變,它能夠適應嗎?」除了概念澄清,我們將這一分類法定位為改善人工智慧在現實世界多文化環境中評估的有效性和可解釋性的實用工具。如果沒有這樣的構念清晰性,評估結果將有過度誇大模型能力的風險,並可能導致在文化敏感的情境中做出不當的部署決策。

Beyond Content: A Comprehensive Speech Toxicity Dataset and Detection Framework Incorporating Paralinguistic Cues

2605.15984v1 by Zhongjie Ba, Liang Yi, Peng Cheng, Qingcao Li, Qinglong Wang, Li Lu

Toxic speech detection has become a crucial challenge in maintaining safe online communication environments. However, existing approaches to toxic speech detection often neglect the contribution of paralinguistic cues, such as emotion, intonation, and speech rate, which are key to detecting speech toxicity. Moreover, current toxic speech datasets are predominantly text-based, limiting the development of models that can capture paralinguistic cues.To address these challenges, we present ToxiAlert-Bench, a large-scale audio dataset comprising over 30,000 audio clips annotated with seven major toxic categories and twenty fine-grained toxic labels. Uniquely, our dataset annotates toxicity sources -- distinguishing between textual content and paralinguistic origins -- for comprehensive toxic speech analysis.Furthermore, we propose a dual-head neural network with a multi-stage training strategy tailored for toxic speech detection. This architecture features two task-specific classification headers: one for identifying the source of sensitivity (textual or paralinguistic), and the other for categorizing the specific toxic type. The training process involves independent head training followed by joint fine-tuning to reduce task interference. To mitigate data class imbalance, we incorporate class-balanced sampling and weighted loss functions.Our experimental results show that leveraging paralinguistic features significantly improves detection performance. Our method consistently outperforms existing baselines across multiple evaluation metrics, with a 21.1% relative improvement in Macro-F1 score and a 13.0% relative gain in accuracy over the strongest baseline, highlighting its enhanced effectiveness and practical applicability.

摘要:有毒言論檢測已成為維護安全在線溝通環境的一項重要挑戰。然而,現有的有毒言論檢測方法往往忽視了副語言線索的貢獻,例如情感、語調和語速,這些對於檢測言論毒性至關重要。此外,目前的有毒言論數據集主要基於文本,限制了能夠捕捉副語言線索的模型的發展。為了解決這些挑戰,我們提出了 ToxiAlert-Bench,一個大型音頻數據集,包含超過 30,000 個音頻片段,並標註了七個主要的有毒類別和二十個細分的有毒標籤。我們的數據集獨特地標註了毒性來源——區分文本內容和副語言來源——以便進行全面的有毒言論分析。此外,我們提出了一種雙頭神經網絡,配備專為有毒言論檢測量身定制的多階段訓練策略。這種架構具有兩個特定任務的分類頭:一個用於識別敏感來源(文本或副語言),另一個用於分類具體的有毒類型。訓練過程包括獨立的頭部訓練,然後進行聯合微調,以減少任務干擾。為了減輕數據類別不平衡,我們採用了類別平衡抽樣和加權損失函數。我們的實驗結果顯示,利用副語言特徵顯著提高了檢測性能。我們的方法在多個評估指標上始終優於現有基準,相對於最強基準在 Macro-F1 分數上提高了 21.1%,在準確率上提高了 13.0%,突顯了其增強的有效性和實際應用性。

Ontology for Policing: Conceptual Knowledge Learning for Semantic Understanding and Reasoning in Law Enforcement Reports

2605.15978v1 by Anita Srbinovska, Jansen Orfan, Adrian Martin, Ernest Fokoué

Law enforcement reports contain structured fields and written narratives. However, many incident facts that are needed for review, police training, and investigations are in natural language and require manual reading. We propose a framework using symbolic methods for converting narratives into evidence-linked facts. Our objective is to measure the value of narratives to recover incident details only from the unstructured text and build temporal graphs with time cues and domain axioms. We achieve this by redacting personal identifiers, semantic parsing, predicate mapping to ontology, and reasoning. We evaluate the symbolic approach on 450 property crime reports and a short human review. Of the extracted events from the system, 54.1% had a confidence score of at least 0.80 and 93.7% were mapped through the PropBank--VerbNet--WordNet semantic path. 100% agreement was reached on incident initiation, stolen items, and temporal cues and lower agreement for forced entry interpretation.

摘要:執法報告包含結構化欄位和書面敘述。然而,許多事件事實需要進行審查、警方訓練和調查的資訊都是自然語言,並需要人工閱讀。我們提出了一個框架,使用符號方法將敘述轉換為與證據相關的事實。我們的目標是衡量敘述的價值,以僅從非結構化文本中恢復事件細節,並建立帶有時間提示和領域公理的時間圖。我們通過編輯個人識別信息、語義解析、謂詞映射到本體和推理來實現這一目標。我們在450份財產犯罪報告和一次簡短的人類審查上評估了這種符號方法。從系統中提取的事件中,有54.1%的信心分數至少為0.80,93.7%通過PropBank--VerbNet--WordNet語義路徑進行映射。在事件啟動、被盜物品和時間提示方面達成了100%的一致,而對於強行入侵的解釋則一致性較低。

Reference-Free Reinforcement Learning Fine-Tuning for MT: A Seq2Seq Perspective

2605.15976v1 by Ernesto Garcia-Estrada, Carlos Escolano, José A. R. Fonallosa

Production machine translation relies overwhelmingly on encoder-decoder Seq2Seq models, yet reinforcement learning approaches to MT fine-tuning have largely targeted decoder-only LLMs at $\geq$7B parameters, with limited systematic study of encoder-decoder architectures. We apply Group Relative Policy Optimization to NLLB-200 (600M and 1.3B) using a hybrid reference-free reward (LaBSE and COMET-Kiwi) that requires no parallel data at fine-tuning time, evaluating across 13 typologically diverse languages. GRPO yields consistent improvements on all 13 languages, up to $+$5.03 chrF++ for Traditional Chinese, and, without any target-language data, competes with 3-epoch supervised fine-tuning on morphologically complex languages . We identify a consistent empirical pattern in which gains are largest where baseline performance is weakest and reward discriminability is highest, making this approach most effective precisely where parallel data is scarcest, and replicate this pattern across English and Spanish source languages.

摘要:生產機器翻譯主要依賴編碼器-解碼器的Seq2Seq模型,但強化學習方法對於MT的微調主要針對參數數量達到$\geq$7B的僅解碼器LLM,對編碼器-解碼器架構的系統性研究有限。我們將群體相對策略優化應用於NLLB-200(600M和1.3B),使用一種混合的無參考獎勵(LaBSE和COMET-Kiwi),在微調時不需要平行數據,並在13種類型多樣的語言中進行評估。GRPO在所有13種語言上均取得了一致的改進,對於繁體中文達到$+$5.03 chrF++,而且在沒有任何目標語言數據的情況下,與在形態複雜語言上的3個時期的監督微調相競爭。我們識別出一個一致的實證模式,即增益在基線性能最弱和獎勵可區分性最高的地方最大,這使得這種方法在平行數據最稀缺的地方最有效,並在英語和西班牙語源語言中重複這一模式。

Learning Bilevel Policies over Symbolic World Models for Long-Horizon Planning

2605.15975v1 by Dillon Z. Chen, Till Hofmann, Toryn Q. Klassen, Sheila A. McIlraith

We tackle the challenge of building embodied AI agents that can reliably solve long-horizon planning problems. Imitation learning from demonstrations has shown itself to be effective in training robots to solve a diversity of complex tasks requiring fine motor control and manipulation over low-level (LL), continuous environments. Yet, it remains a difficult endeavour to generate long-horizon plans from imitation learning alone. In contrast, high-level (HL), symbolic abstractions facilitate efficient and interpretable long-horizon planning. We propose to combine the strengths of LL imitation learning for manipulation and control, and HL symbolic abstractions for long-horizon planning. We realise this idea via \emph{bilevel policies} of the form $(π^{\mathrm{hl}}, π^{\mathrm{ll}})$, consisting of a neural policy $π^{\mathrm{ll}}$ learned from LL demonstrations, and an HL symbolic policy $π^{\mathrm{hl}}$ that is constructed from symbolic abstractions of the LL demonstrations combined with inductive generalisation. We implement these ideas in the BISON system. Experiments on extended MetaWorld benchmarks demonstrate that BISON generalises to long horizons and problems with greater numbers of objects than those solved by VLA and end-to-end methods, and is more time and memory efficient in training and inference. Notably, when ignoring LL execution, BISON's HL policies can solve HL problems with 10,000 relevant objects in under a minute. Project page: https://dillonzchen.github.io/bison

摘要:我們面對建立能夠可靠解決長期規劃問題的具身 AI 代理的挑戰。從示範中學習模仿已顯示出在訓練機器人解決需要精細運動控制和操作的多樣化複雜任務方面的有效性,這些任務通常是在低層(LL)連續環境中進行的。然而,僅僅依靠模仿學習來生成長期計劃仍然是一項困難的工作。相對而言,高層(HL)符號抽象有助於高效且可解釋的長期規劃。我們提議結合 LL 模仿學習在操作和控制方面的優勢,以及 HL 符號抽象在長期規劃中的應用。我們通過形式為 $(π^{\mathrm{hl}}, π^{\mathrm{ll}})$ 的 \emph{雙層策略}來實現這一想法,其中包括從 LL 示範中學習的神經策略 $π^{\mathrm{ll}}$,以及從 LL 示範的符號抽象結合歸納推廣構建的 HL 符號策略 $π^{\mathrm{hl}}$。我們在 BISON 系統中實現這些想法。在擴展的 MetaWorld 基準上的實驗表明,BISON 能夠推廣到比 VLA 和端到端方法解決的問題更長的時間範圍和更多的物體,並且在訓練和推理中更具時間和內存效率。值得注意的是,當忽略 LL 執行時,BISON 的 HL 策略可以在不到一分鐘的時間內解決具有 10,000 個相關物體的 HL 問題。項目頁面:https://dillonzchen.github.io/bison

Deterministic Event-Graph Substrates as World Models for Counterfactual Reasoning

2605.15967v1 by Fabio Rovai

We study event-graph substrates: a class of world models that represent agent state as an append-only log of typed RDF triples and answer counterfactual queries by forking the log under a structured intervention vocabulary. Substrates are inspectable at the triple level, support exact counterfactuals, and transfer across domains without learned components. We formalize the class, prove a duality between explanatory and counterfactual queries that reduces both to the same causal-ancestor traversal, and evaluate a 1,400-line CLEVRER-DSL interpreter atop a domain-agnostic substrate runtime at full CLEVRER validation scale (n=75,618). The substrate exceeds the NS-DR symbolic oracle on all four per-question categories (by 9.89, 20.26, 17.65, and 0.80 percentage points), and exceeds the parametric ALOE baseline on descriptive and explanatory while lagging on predictive and counterfactual. We also introduce twin-EventLog, a 500-specification Park-canonical Smallville counterfactual benchmark on which the substrate exceeds Llama-3.1-8B with full context by 18.80 points joint accuracy.

摘要:我們研究事件圖基底:這是一類世界模型,將代理狀態表示為一個僅附加的類型化 RDF 三元組日誌,並通過在結構化干預詞彙下分叉日誌來回答反事實查詢。基底可在三元組層級進行檢查,支持精確的反事實,並在沒有學習組件的情況下跨領域轉移。我們對這一類進行形式化,證明了解釋性查詢和反事實查詢之間的對偶性,將兩者都簡化為相同的因果祖先遍歷,並在全 CLEVRER 驗證規模(n=75,618)上評估一個 1,400 行的 CLEVRER-DSL 解釋器,該解釋器基於一個領域無關的基底運行時。該基底在所有四個每問題類別上超過了 NS-DR 符號神諭(分別為 9.89、20.26、17.65 和 0.80 個百分點),並在描述性和解釋性上超過了參數化 ALOE 基準,但在預測性和反事實上落後。我們還介紹了 twin-EventLog,這是一個 500 個規範的 Park-canonical Smallville 反事實基準,在這個基準上,基底在完整上下文中超過了 Llama-3.1-8B,達到 18.80 分的聯合準確率。

PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control

2605.15963v1 by Jingxuan Wei, Xi Bai, Shan Liu, Caijun Jia, Zheng Sun, Xinglong Xu, Siyuan Li, Linzhuang Sun, Bihui Yu, Conghui He, Cheng Tan

Large vision-language models have significantly advanced GUI agents, enabling executable interaction across web, mobile, and desktop interfaces. Yet these gains largely rely on a forgiving region-tolerant paradigm, where many nearby pixels inside the same component remain valid. Precise geometric construction breaks this assumption: actions must land on points in continuous canvas space rather than tolerant regions. Because geometric primitives carry ontological dependencies, a local coordinate error can induce cascading topological failures that distort downstream objects and invalidate the final construction. We identify this regime as precision-sensitive GUI tasks, requiring point-level accuracy, geometry-aware verification, and robustness to dependency-driven error propagation. To benchmark it, we introduce PAGE Bench, with 4,906 problems and over 224K process-supervised, pixel-level GUI actions. We further propose PAGER, a topology-aware agent that decomposes construction into dependency-structured planning and pixel-level execution. Pixel-grounded supervised tuning establishes executable action grammar, while precision-aligned reinforcement learning mitigates rollout-induced exposure bias through state-conditioned geometric feedback. Experiments reveal a pronounced Semantic-Execution Gap: general multimodal models can exceed 88% action type accuracy yet remain below 6% task success. PAGER closes this gap, delivering 4.1x higher task success than the strongest evaluated general baseline and raising step success rate from below 9% for GUI-specialized agents to over 62%, establishing a new state of the art for point-precise GUI control.

摘要:大型視覺-語言模型顯著推進了 GUI 代理,使其能夠在網頁、移動和桌面介面之間進行可執行的互動。然而,這些進展在很大程度上依賴於一種寬容的區域容忍範式,其中許多相鄰像素在同一組件內保持有效。精確的幾何構造打破了這一假設:行動必須落在連續畫布空間中的點上,而不是容忍區域內。由於幾何原語承載本體依賴性,局部坐標誤差可能會引發連鎖的拓撲失敗,扭曲下游物件並使最終構造無效。我們將這一範疇定義為對精度敏感的 GUI 任務,這需要點級精度、幾何感知驗證以及對依賴驅動誤差傳播的穩健性。為了進行基準測試,我們推出了 PAGE Bench,包含 4,906 個問題和超過 224K 的過程監督、像素級 GUI 行動。我們進一步提出 PAGER,一個拓撲感知代理,將構造分解為依賴結構規劃和像素級執行。基於像素的監督調整建立可執行的行動語法,而精確對齊的強化學習通過狀態條件的幾何反饋減輕了推出引起的暴露偏差。實驗揭示了顯著的語義執行差距:一般的多模態模型可以超過 88% 的行動類型準確率,但任務成功率仍低於 6%。PAGER 縮小了這一差距,提供了比最強的評估一般基線高出 4.1 倍的任務成功率,並將 GUI 專用代理的步驟成功率從低於 9% 提升至超過 62%,為點精確的 GUI 控制建立了新的最先進技術。

Imperfect World Models are Exploitable

2605.15960v1 by Logan Mondal Bhamidipaty, Esmeralda S. Whitammer, David Abel, Mykel J. Kochenderfer, Subramanian Ramamoorthy

We propose a novel definition of model exploitation in reinforcement learning. Informally, a world model is exploitable if it implies that one policy should be strictly preferred over another while the environment's true transition model implies the reverse. We analogize our definition with a prior characterization of reward hacking but show that the associated proof of inevitability does not transfer to exploitation. To overcome this obstruction, we develop a general theory of reward hacking and model exploitation that proves that exploitation is essentially unavoidable on large policy sets and yields the corresponding claim for hacking as a special case. Unfortunately, we also find that the conditions that guarantee unhackability in finite policy sets have no counterpart that precludes exploitation. Consequently, we introduce a relaxed notion of exploitation and derive a safe horizon within which it can be avoided. Taken together, our results establish a formal bridge between reward hacking and model exploitation and elucidate the limits of safe planning in world models.

摘要:我們提出了一個強化學習中模型利用的新定義。非正式地說,當一個世界模型暗示某一策略應該明顯優於另一策略,而環境的真實轉換模型則暗示相反時,該模型是可利用的。我們將我們的定義類比於先前對獎勵駭客行為的描述,但顯示出相關的不可避免性證明並不適用於利用。為了克服這一障礙,我們發展了一個獎勵駭客行為和模型利用的通用理論,證明在大型策略集合中,利用本質上是不可避免的,並且將駭客行為的相應主張作為特例。不幸的是,我們還發現,保證有限策略集合中不可駭客的條件並沒有對應的條件來排除利用。因此,我們引入了一個放鬆的利用概念,並推導出一個可以避免利用的安全範圍。綜合來看,我們的結果建立了獎勵駭客行為和模型利用之間的正式橋樑,並闡明了在世界模型中安全規劃的限制。

When and Why Adversarial Training Improves PINNs: A Neural Tangent Kernel Perspective

2605.15959v1 by Yuan-dong Cao, Chi Chiu SO, Jun-Min Wang, He Wang

Physics-informed neural networks (PINNs) are powerful surrogates for differential equations but are notoriously difficult to train due to spectral bias, stiffness, and poor accuracy on high-frequency or multiscale solutions. Adversarial training based on generative adversarial networks (GANs) has recently gained surprisingly strong empirical results in improving training, but the underlying mechanisms remain elusive. To this end, we propose a new analysis framework for adversarially trained PINNs, based on the key observation of how the discriminator in GANs can influence the training dynamics of PINNs. The framework first provides a much needed theoretical grounding to why and when adversarial training is effective in PINNs, then presents a unified analysis of GANs variants in such training, and finally leads to a new, practical, efficient training algorithm for PINNs. Empirical results demonstrate that our method can significantly reduce the pathology of PINNs training, thereby providing better models with superior performances, often several magnitudes more accurate than alternative methods.

摘要:物理知識驅動的神經網絡(PINNs)是微分方程的強大替代工具,但由於光譜偏差、剛性以及在高頻或多尺度解上的準確性差,訓練起來非常困難。基於生成對抗網絡(GANs)的對抗訓練最近在改善訓練方面獲得了意外強勁的實證結果,但其背後的機制仍然難以捉摸。為此,我們提出了一個新的分析框架,用於對抗訓練的PINNs,基於對GANs中鑑別器如何影響PINNs訓練動態的關鍵觀察。該框架首先為為何以及何時對抗訓練在PINNs中有效提供了迫切需要的理論基礎,然後對此類訓練中的GANs變體進行了統一分析,最後導致了一種新的、實用的、高效的PINNs訓練算法。實證結果表明,我們的方法可以顯著減少PINNs訓練的病態,從而提供更好的模型,具有更卓越的性能,通常比替代方法準確幾個數量級。

Decomposed Vision-Language Alignment for Fine-Grained Open-Vocabulary Segmentation

2605.15942v1 by Chenhao Wang, Yingrui Ji, Yu Meng, Yao Zhu

Open-vocabulary segmentation models often struggle to generalize to unseen combinations of object categories and attributes, because fine-grained descriptions are typically encoded as holistic sentences that entangle multiple semantic units. We propose a Decomposed Vision-Language Alignment framework that explicitly factorizes textual prompts into a concept token and multiple attribute tokens, enabling separate cross-modal interactions for each semantic unit. At the feature level, we introduce a Feature-Gated Cross-Attention module that generates attribute-specific gating maps to fuse information in a multiplicative manner, effectively enforcing compositional semantics. At the scoring level, per-token similarities are aggregated in log-space, producing a stable and interpretable compositional matching. The method can be seamlessly integrated into existing transformer-based segmentation architectures and significantly improves generalization to unseen attribute-category compositions in fine-grained open-vocabulary segmentation benchmarks.

摘要:開放詞彙分割模型通常難以對未見過的物體類別和屬性的組合進行泛化,因為細緻的描述通常被編碼為糾纏多個語義單元的整體句子。我們提出了一個分解的視覺-語言對齊框架,明確地將文本提示分解為概念標記和多個屬性標記,使每個語義單元能夠進行單獨的跨模態互動。在特徵層面,我們引入了一個特徵門控交叉注意力模塊,生成屬性特定的門控圖,以乘法方式融合信息,有效地強化了組合語義。在打分層面,逐標記的相似性在對數空間中聚合,產生穩定且可解釋的組合匹配。該方法可以無縫集成到現有的基於Transformer的分割架構中,並顯著提高對未見屬性-類別組合的泛化能力,在細緻的開放詞彙分割基準中表現出色。

LoCO: Low-rank Compositional Rotation Fine-tuning

2605.15916v1 by An Nguyen, Jaesik Choi, Anh Tong

Parameter-efficient fine-tuning (PEFT) has emerged as an critical technique for adapting large-scale foundation models across natural language processing and computer vision. While existing methods such as low-rank adaptations achieve parameter efficiency via low-rank weight updates, they are limited in their ability to preserve the geometric structure of pretrained representations. We introduce Low-rank Compositional Orthogonal fine-tuning (LoCO), a novel PEFT method that constructs orthogonal transformations through low-rank skew-symmetric matrices and compositional rotation chains. We propose an approximation scheme that enables fully parallel computation of compositional rotations, making the approach practical for high-dimensional feature spaces. Our method maintains low computational complexity while maintaining orthogonality with controlled approximation error. We validate LoCO across diverse domains, including diffusion transformer fine-tuning, vision transformer adaptation, and language model adaptation. Our method demonstrates superior or competitive performance compared to both existing orthogonal and non-orthogonal methods.

摘要:參數高效微調(PEFT)已成為在自然語言處理和計算機視覺中調整大規模基礎模型的關鍵技術。雖然現有的方法如低秩適應通過低秩權重更新實現了參數效率,但它們在保持預訓練表示的幾何結構方面存在限制。我們介紹了低秩組合正交微調(LoCO),這是一種新穎的PEFT方法,通過低秩斜對稱矩陣和組合旋轉鏈構建正交變換。我們提出了一種近似方案,使得組合旋轉的完全並行計算成為可能,從而使該方法在高維特徵空間中具有實用性。我們的方法在保持正交性的同時,控制近似誤差,維持低計算複雜度。我們在多個領域驗證了LoCO,包括擴散Transformer微調、視覺Transformer適應和語言模型適應。我們的方法在性能上優於或與現有的正交和非正交方法具有競爭力。

SLIP & ETHICS: Graduated Intervention for AI Emotional Companions

2605.15915v1 by Minseo Kim

AI emotional companions face a safety-rapport paradox: restrictive safeguards can damage supportive alliance, while permissive systems risk user harm. We present SLIP (Staged Layers of Intervention Protocol), a four-stage graduated methodology deriving interventions (none, soft, hard) from structured qualitative indicators -- affect intensity (a) and narrative dynamism (m) -- alongside ETHICS (Emergent Taxonomy for Human-AI Interaction Context Signals), a "signals not labels" taxonomy. An evaluation combining a small-scale production deployment (N=68 entries, 10 users, 10 weeks) with a synthetic persona battery (N=91, 5 behavioral-risk profiles) achieved 0% false positives for the flow persona and showed expected escalation patterns in crisis-oriented personas. However, initial results showed that 8 consecutive days of high-energy elevation produced zero interventions (0/8), exposing a boundary where the "do not pathologize" principle conflicts with safety. A subsequent three-model stress test demonstrated that increased model capability improves detection from 0/8 to 6/8 while preserving 0/10 flow false positives in the largest model. Read as preliminary, these findings position graduated intervention as a design direction for navigating -- not resolving -- the safety-rapport tension in affective computing.

摘要:AI 情感伴侶面臨安全與關係的悖論:限制性保障措施可能損害支持性聯盟,而寬鬆的系統則有可能危害使用者。我們提出了 SLIP(分階層介入協議),這是一種四階段的漸進方法,根據結構化的定性指標(無、柔性、強硬)導出介入措施——情感強度 (a) 和敘事動態性 (m)——以及 ETHICS(人類與 AI 互動上下文信號的緊急分類),這是一種「信號而非標籤」的分類法。一項結合小規模生產部署(N=68 條目,10 位使用者,10 週)與合成角色電池(N=91,5 種行為風險輪廓)的評估,對流角色實現了 0% 的假陽性率,並在危機導向角色中顯示出預期的升級模式。然而,初步結果顯示,連續 8 天的高能量提升產生了零次介入 (0/8),揭示了「不病理化」原則與安全之間的衝突邊界。隨後的三模型壓力測試顯示,增強模型能力將檢測從 0/8 提升至 6/8,同時在最大模型中保持 0/10 的流角色假陽性率。這些結果被視為初步發現,將漸進介入定位為在情感計算中導航——而非解決——安全與關係緊張的設計方向。

Towards Generalization of Block Attention via Automatic Segmentation and Block Distillation

2605.15913v1 by Shuaiyi Li, Zhisong Zhang, Yan Wang, Lei Zhu, Dongyang Ma, Chenlong Deng, Yang Deng, Wai Lam

Block attention, which processes the input as separate blocks that cannot attend to one another, offers significant potential to improve KV cache reuse in long-context scenarios such as Retrieval-Augmented Generation (RAG). However, its broader application is hindered by two key challenges: the difficulty of segmenting input text into meaningful, self-contained blocks, and the inefficiency of existing block fine-tuning methods that risk degrading performance. To address these, we first construct SemanticSeg, a large and diverse semantic segmentation dataset containing over 30k instances across 16 categories-including books, code, web text, and conversations with text lengths ranging from 2k to 32k. Using this dataset, we train a lightweight segmenter to automatically partition text into human-instinct-aligned blocks with controllable granularity. Second, we propose block distillation, a training framework that is more efficient than block fine-tuning, which uses a frozen full-attention teacher model to guide the block-attention student. This framework integrates three novel components: block sink tokens to mitigate information loss at block boundaries, block dropout to leverage training signals from all blocks, and token-level loss weighting to focus learning on block-attention-sensitive tokens. Experiments across multiple models and benchmarks demonstrate that our segmenter outperforms heuristic and statistical baselines, and block distillation achieves near-full-attention performance under block attention, establishing a practical and scalable pathway for deploying block attention.

摘要:區塊注意力將輸入處理為無法互相注意的獨立區塊,對於在檢索增強生成(RAG)等長上下文場景中改善KV快取重用具有顯著潛力。然而,其更廣泛的應用受到兩個主要挑戰的阻礙:將輸入文本劃分為有意義的、自包含的區塊的難度,以及現有區塊微調方法的低效率,這可能會降低性能。為了解決這些問題,我們首先構建了SemanticSeg,一個大型且多樣的語義分割數據集,包含超過30,000個實例,涵蓋16個類別——包括書籍、代碼、網頁文本和對話,文本長度範圍從2,000到32,000。利用這個數據集,我們訓練了一個輕量級的分割器,自動將文本劃分為與人類直覺對齊的區塊,並具有可控的粒度。其次,我們提出了區塊蒸餾,一種比區塊微調更高效的訓練框架,該框架使用一個凍結的全注意力教師模型來指導區塊注意力學生。這個框架整合了三個新穎的組件:區塊沉降標記以減少區塊邊界的信息損失、區塊隨機失活以利用所有區塊的訓練信號,以及標記級損失加權以專注於區塊注意力敏感的標記。跨多個模型和基準的實驗表明,我們的分割器超越了啟發式和統計基準,而區塊蒸餾在區塊注意力下達到了接近全注意力的性能,建立了一條實用且可擴展的區塊注意力部署途徑。

RaPD: Resolution-Agnostic Pixel Diffusion via Semantics-Enriched Implicit Representations

2605.15908v1 by Yanhao Ge, Shanyan Guan, Weihao Wang, Ying Tai, Mingyu You

Natural images are continuous, yet most generative models synthesize them on discrete grids, limiting resolution-flexible generation. Continuous neural fields enable resolution-free rendering, but prior methods introduce continuity only at the decoding stage as an interpolation module, leaving the generative latent space discretized and reconstruction-oriented. We propose RaPD (Resolution-agnostic Pixel Diffusion), which performs diffusion in a continuous Neural Image Field (NIF) latent space. RaPD bridges this reconstruction-generation gap with Semantic Representation Guidance for generation-aware latent learning and a Coordinate-Queried Attention Renderer for coordinate-conditioned, scale-aware rendering. A single denoised latent can be rendered at arbitrary resolutions by changing only the query coordinates, keeping diffusion cost fixed. Experiments demonstrate superior generation quality and resolution scalability.

摘要:自然圖像是連續的,但大多數生成模型是在離散網格上合成它們,這限制了靈活的解析度生成。連續神經場使無解析度渲染成為可能,但先前的方法僅在解碼階段引入連續性,作為插值模塊,將生成潛在空間保持為離散且以重建為導向。我們提出了RaPD(無解析度像素擴散),它在連續神經圖像場(NIF)潛在空間中執行擴散。RaPD通過語義表示指導來彌補這一重建-生成的差距,實現生成感知的潛在學習,以及坐標查詢注意力渲染器,用於坐標條件、尺度感知的渲染。只需更改查詢坐標,就可以以任意解析度渲染單個去噪潛在,保持擴散成本固定。實驗顯示出優越的生成質量和解析度可擴展性。

Generative Long-term User Interest Modeling for Click-Through Rate Prediction

2605.15905v1 by Jiangli Shao, Kaifu Zheng, Hao Fang, Huimu Ye, Zhiwei Liu, Bo Zhang, Shu Han, Xingxing Wang

Modeling long-term user interests with massive historical user behaviors enhances click-through rate (CTR) prediction performance in advertising and recommendation systems. Typically, a two-stage framework is widely adopted, where a general search unit (GSU) first retrieves top-$k$ relevant behaviors towards the target item, and an exact search unit (ESU) generates interest features via tailored attention. However, current target-centered GSU would ignore other latent user interests, leading to incomplete and biased interest features. Additionally, the matching-based retrieval process in GSUs depends on the pairwise similarity score between target item and each historical behavior, which not only becomes time-consuming for online services as user behaviors continue to grow, but also overlooks the interaction information among user behaviors. To combat these problems, we propose a \textbf{Gen}erative \textbf{L}ong-term user \textbf{I}nterest model named GenLI for CTR prediction. GenLI consists of an interest generation module (IGM), a behavior retrieval module (BRM), and an interest fusion module (IFM). The IGM generates multiple interest distributions to indicate different aspects of real-time user interests, which is target-independent and incorporates interaction information among behaviors, ensuring complete and diverse interest features. The BRM selects related behaviors via a simple lookup operation, reducing the time complexity for weighting each behavior to $O(1)$. Finally, the IFM uses delicate gating mechanisms to generate interest features. Based on the generation process, GenLI improves the diversity of user interests and avoids complex matching-based behavioral retrieval, achieving a better balance between accuracy and efficiency for CTR prediction.

摘要:建模長期用戶興趣與大量歷史用戶行為相結合,能提升廣告和推薦系統中的點擊率(CTR)預測性能。通常,廣泛採用的兩階段框架中,通用搜索單元(GSU)首先檢索與目標項目相關的前$k$個行為,然後精確搜索單元(ESU)通過量身定制的注意力生成興趣特徵。然而,目前以目標為中心的GSU會忽略其他潛在的用戶興趣,導致興趣特徵不完整且存在偏差。此外,GSU中的基於匹配的檢索過程依賴於目標項目與每個歷史行為之間的成對相似度分數,這不僅隨著用戶行為的持續增長而變得耗時,還忽視了用戶行為之間的互動信息。為了解決這些問題,我們提出了一種名為GenLI的\textbf{Gen}erative \textbf{L}ong-term user \textbf{I}nterest模型,用於CTR預測。GenLI由興趣生成模塊(IGM)、行為檢索模塊(BRM)和興趣融合模塊(IFM)組成。IGM生成多個興趣分佈,以指示實時用戶興趣的不同方面,這是與目標無關的,並且納入了行為之間的互動信息,確保興趣特徵的完整性和多樣性。BRM通過簡單的查找操作選擇相關行為,將每個行為的加權時間複雜度降低到$O(1)$。最後,IFM使用精緻的閘控機制生成興趣特徵。基於生成過程,GenLI提高了用戶興趣的多樣性,並避免了基於匹配的複雜行為檢索,在CTR預測中實現了準確性和效率之間的更好平衡。

Uncertainty-Aware Wildfire Smoke Density Classification from Satellite Imagery via CBAM-Augmented EfficientNet with Evidential Deep Learning

2605.15894v1 by Ranjith Chodavarapu

Rapid and accurate wildfire smoke severity assessment from satellite images is essential for emergency response, air quality modeling, and human health risk management. Existing deep learning approaches treat smoke detection as a binary task, producing point estimates without any measure of prediction confidence. We propose a probabilistic framework to categorize a satellite patch into Light, Moderate, and Heavy severity classes and to provide decomposed epistemic and aleatoric uncertainty in a single forward pass. Our architecture uses the backbone of a pre-trained EfficientNet-B3 and a CBAM module with an evidential deep learning head that predicts Dirichlet concentration parameters, directly estimating vacuity (epistemic) and dissonance (aleatoric) without Monte Carlo sampling. Evaluated on 16,298 real satellite patches derived from the Wildfire Detection dataset, our model achieves 93.8% weighted test accuracy (91.1% unweighted) with ECE=0.0274. Selective prediction retaining the most certain 50% of patches achieves 96.7% accuracy. As image quality degrades, uncertainty increases monotonically, and vacuity is a practical scan quality measure. The Moderate class represents transitional smoke conditions that exhibit the highest epistemic uncertainty (mean vacuity = 0.187), confirming the model correctly identifies ambiguous smoke boundary regions. CBAM spatial attention maps localize to structurally distinctive scene regions, and t-SNE demonstrates the clear cluster separation of Light and Heavy smoke.

摘要:快速且準確的野火煙霧嚴重程度評估來自衛星影像,對於應急響應、空氣質量建模以及人類健康風險管理至關重要。現有的深度學習方法將煙霧檢測視為一個二元任務,產生點估計而不提供任何預測信心的度量。我們提出了一個概率框架,將衛星圖像區塊分類為輕度、中度和重度嚴重性類別,並在單次前向傳播中提供分解的認知不確定性和隨機不確定性。我們的架構使用預訓練的EfficientNet-B3作為主幹,並結合CBAM模塊,搭配一個證據深度學習頭,預測Dirichlet濃度參數,直接估計虛無(認知)和不和諧(隨機),而無需蒙特卡羅取樣。在來自野火檢測數據集的16,298個真實衛星圖像區塊上進行評估,我們的模型達到了93.8%的加權測試準確率(未加權為91.1%),ECE=0.0274。選擇性預測保留最確定的50%區塊達到96.7%的準確率。隨著影像質量的下降,不確定性單調增加,而虛無是一個實用的掃描質量度量。中度類別代表過渡性煙霧條件,顯示出最高的認知不確定性(平均虛無=0.187),確認模型正確識別模糊的煙霧邊界區域。CBAM空間注意力圖將焦點定位於結構上獨特的場景區域,而t-SNE則顯示出輕度和重度煙霧的明確簇分離。

CHoE: Cross-Domain Heterogeneous Graph Prompt Learning via Structure-Conditioned Experts

2605.15888v1 by Peiyuan Li, Yongqi Huang, Jitao Zhao, Dongxiao He, Di Jin, Weixiong Zhang

Heterogeneous Graph Prompt Learning (HGPL)has emerged as a promising paradigm for bridging the gap between the objectives of pre-training foundation models and their downstream applications in heterogeneous graph settings. However, existing HGPL methods are primarily designed for in-domain scenarios, whereas real-world deployments often span multiple domains, and the data used for pre-training and downstream tasks may originate from different distributions. Consequently, the applicability of current HGPL approaches is limited to in-domain settings, and their performance typically degrades when application domains shift. To address this serious limitation, we develop CHoE, a cross-domain HGPL method built upon an expert network. During pre-training, we introduce and train structure-conditioned experts, and during prompt tuning, we adopt a structure-aware expert routing and load balancing mechanism to select structurally compatible experts for each meta-path view. In addition, we design a prompt-based semantic fusion module to integrate representations across multiple views for downstream prediction. Extensive experiments show that CHoE consistently improves performance in few-shot cross-domain applications, outperforming all baseline approaches.

摘要:異質圖提示學習(HGPL)已成為彌合預訓練基礎模型目標與其在異質圖設定中下游應用之間差距的有前景的範式。然而,現有的HGPL方法主要針對域內場景設計,而現實世界的部署往往跨越多個領域,且用於預訓練和下游任務的數據可能來自不同的分佈。因此,目前HGPL方法的適用性僅限於域內設置,當應用領域發生變化時,其性能通常會下降。為了解決這一嚴重限制,我們開發了CHoE,一種基於專家網絡的跨域HGPL方法。在預訓練過程中,我們引入並訓練結構條件專家,而在提示調整過程中,我們採用結構感知的專家路由和負載平衡機制,以選擇與每個元路徑視圖結構相容的專家。此外,我們設計了一個基於提示的語義融合模塊,以整合多個視圖的表示以進行下游預測。大量實驗表明,CHoE在少量跨域應用中持續提高性能,超越了所有基準方法。

Linked Multi-Model Data on Russian Domestic and Foreign Policy Speeches

2605.15886v1 by Daria Blinova, Gayathri Emuru, Rakesh Emuru, Kushagradheer Shridheer Srivastava, Mina Rulis, Sunita Chandrasekaran, Benjamin E. Bagozzi

This paper introduces a dataset of interlinked multimodal political communications from the Russian government, addressing persistent deficiencies in the availability of social text- and image-based data for authoritarian politics contexts. The dataset comprises two large corpora of official speeches delivered by senior actors within the Kremlin and the Russian Ministry of Foreign Affairs over multiple decades. For each speech, we provide Russian- and English-language texts, associated images and captions where available, and harmonized metadata including (e.g.) dates, speakers, (geo)locations, and official government content tags. Unique identifiers link images to speeches and align Russian and English versions of the same communication texts. We further augment these linked datasets with validated topical annotations for both speech texts and speech images, which are generated via transformer-based multimodal topic modeling and refined by a Russian politics expert. The resulting data resources support multimodal, multilingual, temporal, and/or spatial analyses of (authoritarian) political communication and offer a valuable testbed for social science research and large language model (LLM) applications in political domains.

摘要:這篇論文介紹了一個來自俄羅斯政府的互聯多模態政治傳播數據集,解決了在威權政治背景下社會文本和圖像數據可用性方面的持續不足。該數據集包含了數十年來克里姆林宮和俄羅斯外交部高級官員發表的兩大官方演講語料庫。對於每篇演講,我們提供俄語和英語文本,相關的圖片和標題(如有),以及包括(例如)日期、演講者、(地理)位置和官方政府內容標籤的統一元數據。唯一標識符將圖片與演講連結,並對齊同一傳播文本的俄語和英語版本。我們進一步用經過驗證的主題註釋來增強這些連結的數據集,這些註釋是通過基於Transformer的多模態主題建模生成的,並由俄羅斯政治專家進行了精煉。最終的數據資源支持對(威權)政治傳播的多模態、多語言、時間和/或空間分析,並為社會科學研究和政治領域的大型語言模型(LLM)應用提供了寶貴的測試平台。

Symplectic Neural Operators for Learning Infinite Dimensional Hamiltonian Systems

2605.15881v1 by Yeang Makara, Yusuke Tanaka, Takashi Matsubara, Takaharu Yaguchi

The modeling and simulation of infinite-dimensional Hamiltonian systems are central problems in mathematical physics and engineering, however they pose significant computational and structural challenges for standard data-driven architectures. In this work, we introduce the Symplectic Neural Operator, a neural operator architecture designed to preserve the symplectic structure intrinsic to Hamiltonian PDEs. We provide a theoretical characterization of their symplecticity and establish a rigorous long-term stability result based on the combination of symplectic structure preservation and learning accuracy. Numerical experiments on canonical Hamiltonian PDEs corroborate this theoretical result and show that SNOs exhibit improved energy behavior compared with non-structure-preserving neural operators.

摘要:無限維哈密頓系統的建模與模擬是數學物理和工程中的核心問題,然而它們對標準數據驅動架構提出了重大的計算和結構挑戰。在本研究中,我們介紹了辛神經運算子(Symplectic Neural Operator),這是一種旨在保留哈密頓偏微分方程內在的辛結構的神經運算子架構。我們提供了它們辛性質的理論特徵,並基於辛結構保護和學習準確性的結合建立了嚴格的長期穩定性結果。對典型哈密頓偏微分方程的數值實驗證實了這一理論結果,並顯示出SNO在能量行為上相較於非結構保護的神經運算子有改善。

Agentic Discovery of Neural Architectures: AIRA-Compose and AIRA-Design

2605.15871v1 by Alberto Pepe, Chien-Yu Lin, Despoina Magka, Bilge Acun, Yannan Nellie Wu, Anton Protopopov, Carole-Jean Wu, Yoram Bachrach

Toward recursive self-improvement, we investigate LLM agents autonomously designing foundation models beyond standard Transformers. We introduce a dual-framework approach: AIRA-Compose for high-level architecture search, and AIRA-Design for low-level mechanistic implementation. AIRA-Compose uses 11 agents to explore fundamental computational primitives under a 24-hour budget. Agents evaluate million-parameter candidates, extrapolating top designs to 350M, 1B, and 3B scales. This yields 14 architectures across two families: AIRAformers (Transformer-based) and AIRAhybrids (Transformer-Mamba). Pre-trained at 1B scale, these consistently outperform Llama 3.2 and Composer-found baselines. On downstream tasks, AIRAformer-D and AIRAhybrid-D improve accuracy by 2.4% and 3.8% over Llama 3.2. Furthermore, AIRA-Compose finds models with highly efficient scaling frontiers: AIRAformer-C scales 54% and 71% faster than Llama 3.2 and Composer's best Transformer, while AIRAhybrid-C outscales Nemotron-2 by 23% and Composer's best hybrid by 37%. AIRA-Design tasks 20 agents with writing novel attention mechanisms for long-range dependencies and high-performing training scripts. On the Long Range Arena benchmark, agent-designed architectures reach within 2.3% and 2.6% of human state-of-the-art on document matching and text classification. On the Autoresearch benchmark, Greedy Opus 4.5 achieves 0.968 validation bits-per-byte under a fixed time budget, surpassing the published minimum. Together, these frameworks show AI agents can autonomously discover architectures and algorithmic optimizations matching or surpassing hand-designed baselines. This establishes a powerful paradigm for discovering next-generation foundation models, marking a clear step toward recursive self-improvement.

摘要:朝向遞歸自我改善,我們調查 LLM 代理自主設計超越標準 Transformer 的基礎模型。我們引入了一種雙框架方法:AIRA-Compose 用於高層次架構搜尋,AIRA-Design 用於低層次機械實現。AIRA-Compose 使用 11 個代理在 24 小時的預算內探索基本計算原語。代理評估百萬參數的候選者,將頂尖設計外推至 350M、1B 和 3B 的規模。這產生了兩個系列中的 14 種架構:AIRAformers(基於 Transformer)和 AIRAhybrids(Transformer-Mamba)。在 1B 規模下預訓練,這些模型始終超越 Llama 3.2 和 Composer 所找到的基準。在下游任務中,AIRAformer-D 和 AIRAhybrid-D 分別提高了 2.4% 和 3.8% 的準確率,超越 Llama 3.2。此外,AIRA-Compose 找到了具有高效擴展邊界的模型:AIRAformer-C 的擴展速度比 Llama 3.2 和 Composer 的最佳 Transformer 快 54% 和 71%,而 AIRAhybrid-C 的擴展速度比 Nemotron-2 快 23% 和 Composer 的最佳混合模型快 37%。AIRA-Design 指派 20 個代理撰寫新穎的注意力機制,以處理長距離依賴和高效能的訓練腳本。在 Long Range Arena 基準測試中,代理設計的架構在文檔匹配和文本分類上達到人類最先進技術的 2.3% 和 2.6% 之內。在 Autoresearch 基準測試中,Greedy Opus 4.5 在固定時間預算下達到 0.968 的驗證位元/字節,超越已發表的最低標準。這些框架共同顯示 AI 代理可以自主發現架構和算法優化,與手動設計的基準相匹配或超越。這為發現下一代基礎模型建立了一個強大的範式,標誌著朝向遞歸自我改善邁出了一個明確的步伐。

Are VLMs Seeing or Just Saying? Uncovering the Illusion of Visual Re-examination

2605.15864v1 by Chufan Shi, Cheng Yang, Yaokang Wu, Linhao Jin, Bo Shui, Taylor Berg-Kirkpatrick, Xuezhe Ma

Vision-Language Models (VLMs) often produce self-reflective statements like "let me check the figure again" during reasoning. Do such statements trigger genuine visual re-examination, or are they merely learned textual patterns? We investigate this via VisualSwap, an image-swap probing framework: after a model reasons over an image, we replace it with a visually similar but semantically different one and test whether the model notices. We introduce VS-Bench, 800 image pairs curated from MathVista, MathVerse, MathVision, and MMMU-Pro. Experiments on Qwen3-VL, Kimi-VL, and ERNIE-VL reveal a striking failure: models overwhelmingly miss the swap, with accuracy dropping by up to 60%. Counterintuitively, thinking models are nearly 3x more vulnerable than their instructed counterparts, and scaling offers no mitigation. Multi-turn user instructions restore visual grounding, but self-generated reflective statements during continuous generation do not. Attention analysis explains why: user instructions substantially elevate attention to visual tokens, whereas self-reflection does not. Current VLMs tend to say rather than actually see when claiming to perform visual re-examination. Our code and dataset are available at the project page: https://visualswap.github.io

摘要:視覺-語言模型(VLMs)在推理過程中經常產生自我反思的陳述,例如「讓我再檢查一下這個圖」。這些陳述是否觸發真正的視覺重新檢查,還是僅僅是學習到的文本模式?我們通過 VisualSwap,一個圖像交換探測框架來調查這一點:在模型對一幅圖像進行推理後,我們用一幅視覺上相似但語義上不同的圖像替換它,並測試模型是否注意到。 我們引入了 VS-Bench,這是一組來自 MathVista、MathVerse、MathVision 和 MMMU-Pro 的 800 對圖像。對 Qwen3-VL、Kimi-VL 和 ERNIE-VL 的實驗揭示了一個驚人的失敗:模型幾乎完全錯過了交換,準確率下降了多達 60%。反直覺的是,思考模型的脆弱性幾乎是其指令對應物的三倍,而擴展並未提供任何緩解。多輪用戶指令恢復了視覺基礎,但在持續生成過程中自我生成的反思陳述則沒有。注意力分析解釋了原因:用戶指令顯著提高了對視覺標記的注意力,而自我反思則沒有。當聲稱進行視覺重新檢查時,當前的 VLMs 傾向於說而不是實際看到。我們的代碼和數據集可在項目頁面獲得:https://visualswap.github.io

Conversations in Space: Structuring Non-Linear LLM Interactions on a Canvas

2605.15848v1 by Rifat Mehreen Amin, Alperen Adatepe, Daniela Fernandes, Daniel Buschek, Andreas Butz

Conversational interfaces powered by large language models (LLMs) are widely used for ideation and analysis, yet their linear structure limits exploration of alternatives and management of long-running interactions. We present CanvasConvo, a conversational interface concept that transforms linear chat into a branching conversation tree embedded in a spatial canvas. CanvasConvo enables users to explore what-if scenarios by branching directly from conversational content, supporting parallel development of alternative directions. These branches are visualized on a canvas while remaining integrated with a familiar chat interface, allowing users to switch between linear and non-linear interaction. Features such as timeline-based navigation, automatic tagging and summarization, and context-aware controls (e.g., goals, reusable prompts) support structured interaction and continuity. We evaluated CanvasConvo in a 5-7 day field study with 24 participants. Our findings highlight how non-linear conversational structures support exploratory workflows and different interactions in LLM-based work.

摘要:對話介面由大型語言模型 (LLMs) 驅動,廣泛用於創意發想和分析,但其線性結構限制了替代方案的探索和長時間互動的管理。我們提出了 CanvasConvo,一個將線性聊天轉變為嵌入在空間畫布中的分支對話樹的對話介面概念。CanvasConvo 使使用者能夠通過直接從對話內容分支來探索假設情境,支持替代方向的平行發展。這些分支在畫布上可視化,同時與熟悉的聊天介面保持整合,允許使用者在線性和非線性互動之間切換。基於時間線的導航、自動標記和摘要、以及上下文感知控制(例如,目標、可重用提示)等功能支持結構化互動和連貫性。我們在一項為期 5-7 天的實地研究中評估了 CanvasConvo,參與者共有 24 位。我們的研究結果突顯了非線性對話結構如何支持探索性工作流程和 LLM 基礎工作的不同互動。

RoadmapBench: Evaluating Long-Horizon Agentic Software Development Across Version Upgrades

2605.15846v1 by Xinbo Xu, Ruihan Yang, Haiyang Shen, Wendong Xu, Bofei Gao, Ruoyu Wu, Kean Shi, Weichu Xie, Xuanzhong Chen, Ming Wu, Jason Zeng, Michael Heinrich, Elvis Zhang, Liang Chen, Kuan Li, Baobao Chang

Coding agents are increasingly deployed in real software development, where a single version iteration requires months of coordinated work across many files. However, most existing benchmarks focus predominantly on single-issue bug fixes from Python repositories, with coarse pass/fail evaluation outcomes, and thus fail to capture long-horizon, multi-target development at real engineering scale. To address this gap, we present RoadmapBench, a benchmark of 115 long-horizon coding tasks grounded in real open-source version upgrades across 17 repositories and 5 programming languages. Each task places the agent on a source-version code snapshot and provides a multi-target roadmap instruction requiring it to implement the functionality introduced in the target version, with a median modification of 3,700 lines across 51 files. We conduct a systematic evaluation on thirteen frontier models and find that even the strongest, Claude-Opus-4.7, resolves only 39.1% of tasks, while the weakest achieves merely 5.2%, in stark contrast to existing bug-fix benchmarks, suggesting that long-horizon software development remains a largely unsolved problem.

摘要:編碼代理在實際軟體開發中越來越多地被部署,其中單一版本迭代需要跨越多個檔案的數月協調工作。然而,大多數現有基準主要集中在 Python 儲存庫中的單一問題修復,並且使用粗略的通過/不通過評估結果,因此無法捕捉到真實工程規模下的長期、多目標開發。為了解決這一差距,我們提出了 RoadmapBench,這是一個基於 17 個儲存庫和 5 種程式語言的 115 個長期編碼任務的基準,這些任務根植於真實的開源版本升級。每個任務將代理放置在一個源版本的代碼快照上,並提供一個多目標的路線圖指令,要求其實現目標版本中引入的功能,這在 51 個檔案中平均修改 3,700 行。我們對十三個前沿模型進行了系統評估,發現即使是最強的 Claude-Opus-4.7 也僅解決了 39.1% 的任務,而最弱的僅達到 5.2%,這與現有的錯誤修復基準形成了鮮明對比,這表明長期軟體開發仍然是一個尚未解決的問題。

GAP: Geometric Anchor Pre-training for Data-Efficient Visuomotor Learning of Manipulation Tasks

2605.15836v1 by Davide Buoso, Andrea Protopapa, Stefano Di Carlo, Francesca Pistilli, Giuseppe Averta

Learning visuomotor policies from scarce expert demonstrations remains a core challenge in robotic manipulation. A primary hurdle lies in distilling high-dimensional RGB representations into control-relevant geometry without overfitting. While using frozen pre-trained Vision Foundation Models (VFMs) improves data efficiency, it also shifts most task adaptation onto a small spatial pooling module, which can latch onto task-irrelevant shortcuts and lose geometric grounding when finetuned with few data samples. More broadly, pre-trained visual representations used for policy learning have been observed to struggle under even minor scene perturbations, highlighting the need for robustness-oriented inductive biases. We propose Geometric Anchor Pre-training (GAP), a simple, action-free warm-up stage that regularizes the spatial adapter before downstream imitation learning. GAP pre-trains the pooling layer on a lightweight simulated proxy task where object masks are available at no cost, encouraging the adapter to produce keypoints that lie on the object, cover its spatial extent, and remain sharp and repeatable over time. This yields stable geometric anchors that provide a reliable coordinate interface for few-shot policy learning, while keeping the VFM frozen. We evaluate GAP on RoboMimic and ManiSkill under severe data scarcity (15-50 demonstrations) and domain shift. A simple adapter regularized with GAP consistently outperforms stronger attention-based poolers and end-to-end fine-tuning, achieving 62% success on RoboMimic Can with 15 demonstrations (+16% over AFA), 63% on the long-horizon high-precision Tool Hang task with 50 demonstrations, and 61% on ManiSkill StackCube with 30 demonstrations (+11% over full fine-tuning). The proxy stage is lightweight and fully decoupled from downstream tasks, making it practical to reuse across environments and manipulation skills.

摘要:學習來自稀缺專家示範的視覺運動策略仍然是機器人操作中的核心挑戰。一個主要的障礙在於將高維RGB表示轉化為與控制相關的幾何形狀,而不會過擬合。雖然使用凍結的預訓練視覺基礎模型(VFM)提高了數據效率,但它也將大部分任務適應轉移到一個小的空間池模塊上,這可能會抓住與任務無關的捷徑,並在用少量數據樣本進行微調時失去幾何基礎。更廣泛地說,用於策略學習的預訓練視覺表示在面對甚至輕微的場景擾動時也被觀察到存在困難,這突顯了對於穩健性導向的歧視偏見的需求。我們提出了幾何錨點預訓練(GAP),這是一個簡單的、無動作的熱身階段,在下游模仿學習之前對空間適配器進行正則化。GAP在一個輕量級的模擬代理任務上對池層進行預訓練,該任務中物體掩碼可以免費獲得,鼓勵適配器生成位於物體上的關鍵點,覆蓋其空間範圍,並隨時間保持清晰和可重複性。這產生了穩定的幾何錨點,為少樣本策略學習提供了可靠的坐標接口,同時保持VFM不變。我們在RoboMimic和ManiSkill上評估GAP,在嚴重的數據稀缺(15-50個示範)和領域轉移的情況下。一個用GAP正則化的簡單適配器始終優於更強的基於注意力的池化器和端到端微調,在RoboMimic Can上以15個示範達到62%的成功率(比AFA高出16%),在長期高精度工具懸掛任務上以50個示範達到63%,以及在ManiSkill StackCube上以30個示範達到61%(比完全微調高出11%)。代理階段輕量且與下游任務完全解耦,使其在不同環境和操作技能中實用可重用。

Modeling Music as a Time-Frequency Image: A 2D Tokenizer for Music Generation

2605.15831v1 by Yuqing Cheng, Xingyu Ma, Guochen Yu, Xiaotao Gu

Autoregressive music generation depends strongly on the audio tokenizer. Existing high-fidelity codecs often use residual multi-codebook quantization, which preserves reconstruction quality but complicates language modeling after sequence flattening, as the residual hierarchy imposes strong sequential dependencies and can amplify error accumulation. We propose BandTok, a generation-oriented 2D Mel-spectrogram tokenizer that represents each frame with Mel-frequency band tokens from a single shared codebook. This design yields a physically interpretable time-frequency token grid with a more independent token structure, making it better suited for autoregressive modeling. BandTok improves reconstruction with a multi-scale PatchGAN objective and EMA codebook updates. We further introduce an autoregressive language model with 2D Rotary Position Embedding (2D RoPE) to preserve temporal and frequency-band structure during generation. Experiments show that BandTok improves over residual-codebook tokenizers and achieves strong results in a data-limited setting. The source code and generation demos for this work are publicly available.

摘要:自回歸音樂生成在很大程度上依賴於音頻標記器。現有的高保真編解碼器通常使用殘差多碼本量化,這雖然保留了重建質量,但在序列展平後使語言建模變得複雜,因為殘差層次結構施加了強烈的序列依賴性,並可能放大錯誤累積。我們提出了BandTok,一種以生成為導向的2D梅爾頻譜標記器,使用來自單一共享碼本的梅爾頻帶標記來表示每一幀。這種設計產生了一個物理可解釋的時間-頻率標記網格,具有更獨立的標記結構,使其更適合自回歸建模。BandTok通過多尺度PatchGAN目標和EMA碼本更新來改善重建。我們進一步引入了一種具有2D旋轉位置嵌入(2D RoPE)的自回歸語言模型,以在生成過程中保留時間和頻帶結構。實驗表明,BandTok在殘差碼本標記器上有所改進,並在數據有限的環境中取得了良好的結果。這項工作的源代碼和生成演示已公開可用。

Toward Natural and Companionable Virtual Agents via Cross-Temporal Emotional Modeling

2605.15812v1 by Feier Qin, Xiao Li, Yi Zheng, Haibin Huang, Hanyao Wang, Xiaoyu Wang, Yan Lu, Yuan Zhang

Recent advances in foundation models have enabled conversational agents that aim for sustained companionship rather than mere task completion. Yet most still remain unable to support natural, long-term companion-like interactions, resulting in experiences that feel episodic and inauthentic. We argue that current agents overlooked cross-temporal modeling of agents' social behaviors and internal emotions: generated behaviors rarely influence an agent's emotional state, and emotional states seldom shape subsequent behaviors. We present Cross-Temporal Emotion Modeling (CTEM), a framework that links long-term behavioral history to moment-to-moment emotional expression. CTEM establishes a closed loop where past experiences update an evolving emotional state; this state conditions immediate interactions; and user feedback continually revises both memory and emotional state, enabling reflection and anticipation. We instantiate CTEM as Auri, a companion agent on an instant-messaging platform, and report a 21-day in-the-wild study showing that CTEM shows improvements in perceived naturalness, coherence, and emotional harmony.

摘要:最近在基礎模型方面的進展使得對話代理能夠追求持續的陪伴,而不僅僅是完成任務。然而,大多數仍然無法支持自然的、長期的伴侶式互動,導致的體驗感覺是片段式和不真實的。我們認為當前的代理忽視了代理社交行為和內部情感的跨時間建模:生成的行為很少影響代理的情感狀態,而情感狀態也很少塑造隨後的行為。我們提出了跨時間情感建模(CTEM),這是一個將長期行為歷史與瞬時情感表達聯繫起來的框架。CTEM 建立了一個閉環,其中過去的經驗更新不斷演變的情感狀態;這一狀態調節即時互動;而用戶反饋不斷修正記憶和情感狀態,使得反思和預期成為可能。我們將 CTEM 實現為 Auri,一個在即時消息平台上的伴侶代理,並報告了一項為期 21 天的實地研究,顯示 CTEM 在感知自然性、一致性和情感和諧方面有所改善。

ForMaT: Dataset for Visually-Grounded Multilingual PDF Translation

2605.15794v1 by Michał Ciesiółka, Dawid Wiśniewski, Adrian Charkiewicz, Kamil Guttmann

We present ForMaT (Format-Preserving Multilingual Translation), a parallel corpus of 3,956 PDFs across 15 language pairs that preserves original layout metadata proposed for multimodal machine translation. To ensure structural diversity in the dataset, we employ K-Medoids sampling over 45 geometric features, capturing complex elements like nested tables and formulas to focus only on visually diverse PDF documents. Our evaluation reveals that current MT systems struggle with spatial grounding and geometric synchronization, often losing the link between text and its visual context. ForMaT provides a benchmark for developing layout-aware translation models that integrate visual and textual context for high-fidelity document reconstruction.

摘要:我們提出了ForMaT(格式保留多語言翻譯),這是一個包含3,956個PDF的平行語料庫,涵蓋15種語言對,旨在保留為多模態機器翻譯所提議的原始佈局元數據。為了確保數據集中的結構多樣性,我們在45個幾何特徵上採用了K-Medoids抽樣,捕捉複雜元素,如嵌套表格和公式,專注於視覺上多樣的PDF文檔。我們的評估顯示,當前的機器翻譯系統在空間基礎和幾何同步方面存在困難,經常失去文本與其視覺上下文之間的聯繫。ForMaT為開發能夠整合視覺和文本上下文的佈局感知翻譯模型提供了一個基準,以實現高保真度的文檔重建。

Grokking as Structural Inference: Transformers Need Bayesian Lottery Tickets

2605.15787v1 by Kai Hidajat, Solden Stoll, Joseph An

Why does a Transformer that has memorized its training set wait thousands of steps before it generalizes? Existing accounts locate this delay in norm minimization, feature emergence, or the late discovery of sparse subnetworks. These explanations capture important parts of the transition, but ignore a constraint unique to attention-based models: if attention discards an informative token, no bounded downstream computation can recover it. We formalize attention as an implicit Bayesian posterior over the task dependency graph and prove that generalization requires two separable conditions: a familiar Goldilocks bound on MLP capacity, coinciding with norm-based theories of grokking, and a novel Bayesian structural condition requiring attention to place sufficient mass on every informative token. This decoupling explains delayed generalization as delayed structural inference. Early in training, the MLP memorizes through unaligned features, drives the cross-entropy loss near zero, and thereby starves attention of structural gradient. Weight decay must then erode memorization before the missing graph becomes learnable, yielding the known inverse-weight-decay delay, which we derive as a structural waiting time. We then prove that this explaining-away delay can be bypassed by a KL-based structural intervention, yielding an inverse-intervention-strength scaling law for the grokking time. Experiments on algorithmic sequence tasks isolate structure from capacity and show that this Bayesian ticket matches or outperforms lottery-ticket transfer.

摘要:為什麼一個已經記住其訓練集的Transformer會在進行泛化之前等待數千步?現有的解釋將這一延遲歸因於範數最小化、特徵出現或稀疏子網絡的晚期發現。這些解釋捕捉了轉變的重要部分,但忽略了一個對基於注意力的模型獨特的約束:如果注意力丟棄了一個有信息的標記,則沒有有界的下游計算可以恢復它。我們將注意力形式化為任務依賴圖上的隱式貝葉斯後驗,並證明泛化需要兩個可分的條件:對MLP容量的熟悉的金髮女孩界限,與基於範數的grokking理論相符,以及一個新的貝葉斯結構條件,要求注意力在每個有信息的標記上放置足夠的質量。這種解耦解釋了延遲泛化的原因是延遲的結構推斷。在訓練的早期,MLP通過不對齊的特徵進行記憶,將交叉熵損失推近於零,從而使注意力缺乏結構梯度。權重衰減必須在缺失的圖變得可學習之前侵蝕記憶,產生已知的逆權重衰減延遲,我們將其推導為結構等待時間。我們然後證明,這種解釋延遲可以通過基於KL的結構干預來繞過,從而產生grokking時間的逆干預強度縮放法則。在算法序列任務上的實驗將結構與容量隔離,並顯示這種貝葉斯票證匹配或超越了彩票票證轉移。

SaaS-Bench: Can Computer-Use Agents Leverage Real-World SaaS to Solve Professional Workflows?

2605.15777v1 by Kean Shi, Zihang Li, Tianyi Ma, Zengji Tu, Jialong Wu, Xinbo Xu, Qingyao Yang, Ruoyu Wu, Weichu Xie, Ming Wu, Jason Zeng, Michael Heinrich, Elvis Zhang, Liang Chen, Kuan Li, Baobao Chang

Computer-Using Agents (CUAs) are rapidly extending large language models (LLMs) beyond text-based reasoning toward action execution in more complex environments, such as web browsers and graphical user interfaces (GUIs). However, existing web and GUI agent benchmarks often rely on simplified settings, isolated tasks, or short-horizon interactions, making it difficult to assess capabilities of agents in realistic professional workflows. Software-as-a-Service (SaaS) environments are a natural choice for CUA evaluation, as they host a large share of modern digital work and naturally involve dynamic system states, cross-application coordination, domain-specific knowledge, and long-horizon dependencies. To this end, we introduce SaaS-Bench, a benchmark built on 23 deployable SaaS systems across six professional domains, containing 106 tasks grounded in realistic work scenarios. These tasks require long-horizon execution, cover both text-only and multimodal settings, and are evaluated with weighted verification checkpoints that measure strict task completion and partial progress. Experiments show that representative LLM-based agents struggle on SaaS-Bench, with even the strongest model completing fewer than 4% of tasks end-to-end, exposing limitations in planning, state tracking, cross-application context maintenance, and error recovery. Code are available at https://github.com/UniPat-AI/SaaS-Bench for reproduction.

摘要:電腦使用代理(CUAs)正在迅速將大型語言模型(LLMs)從基於文本的推理擴展到在更複雜的環境中執行動作,例如網頁瀏覽器和圖形用戶界面(GUIs)。然而,現有的網頁和GUI代理基準通常依賴於簡化的設置、孤立的任務或短期交互,這使得在現實專業工作流程中評估代理的能力變得困難。軟體即服務(SaaS)環境是CUA評估的自然選擇,因為它們承載了現代數字工作的很大一部分,並自然涉及動態系統狀態、跨應用協調、領域特定知識和長期依賴性。為此,我們介紹了SaaS-Bench,這是一個基於六個專業領域中23個可部署SaaS系統構建的基準,包含106個基於現實工作場景的任務。這些任務需要長期執行,涵蓋文本和多模態設置,並通過加權驗證檢查點進行評估,這些檢查點測量嚴格的任務完成和部分進展。實驗顯示,代表性的基於LLM的代理在SaaS-Bench上表現不佳,即使是最強的模型也僅完成不到4%的任務,暴露了在規劃、狀態跟踪、跨應用上下文維護和錯誤恢復方面的局限性。代碼可在 https://github.com/UniPat-AI/SaaS-Bench 獲得以便重現。

ALSO: Adversarial Online Strategy Optimization for Social Agents

2605.15768v1 by Xiang Li, Liping Yi, Mingze Kong, Min Zhang, Zhongxiang Dai, QingHua Hu

Social simulation provides a compelling testbed for studying social intelligence, where agents interact through multi-turn dialogues under evolving contexts and strategically adapting opponents. Such environments are inherently non-stationary, requiring agents to dynamically adjust their strategies over time. However, most Large Language Model (LLM) based social agents rely on static personas, while existing approaches for enhancing social intelligence, such as offline reinforcement learning or external planners, are ill-suited to these settings, typically assuming stationarity and incurring substantial training overhead. To bridge this gap, we propose \textbf{ALSO} (\textbf{A}dversarial on\textbf{L}ine \textbf{S}trategy \textbf{O}ptimization), the first framework for online strategy optimization in multi-agent social simulation. ALSO advances social adaptation through two key contributions. (1) ALSO formulates multi-turn interaction as an adversarial bandit problem, where combinations of static personas and dynamic strategy instructions are treated as arms, providing a principled solution to non-stationarity without relying on environmental stability assumptions. (2) To predict rewards and generalize sparse feedback in multi-turn dialogues, ALSO introduces a lightweight neural surrogate to predict rewards from interaction histories, enabling sample-efficient exploration and continuous online adaptation. Experiments on the Sotopia benchmark demonstrate that ALSO consistently outperforms static baselines and existing optimization methods in dynamic environments, validating the effectiveness of adversarial online strategy optimization for building robust social agents.

摘要:社會模擬提供了一個引人注目的測試平台,用於研究社會智能,其中代理人通過多輪對話在不斷變化的背景下互動並策略性地適應對手。這樣的環境本質上是非穩定的,要求代理人隨著時間動態調整其策略。然而,大多數基於大型語言模型(LLM)的社會代理人依賴於靜態角色,而現有的增強社會智能的方法,如離線強化學習或外部規劃者,並不適合這些設置,通常假設穩定性並產生大量的訓練開銷。為了填補這一空白,我們提出了\textbf{ALSO}(\textbf{A}dversarial on\textbf{L}ine \textbf{S}trategy \textbf{O}ptimization),這是第一個用於多代理社會模擬的在線策略優化框架。ALSO通過兩個關鍵貢獻推進社會適應。(1) ALSO將多輪互動形式化為一個對抗性賭徒問題,其中靜態角色和動態策略指令的組合被視為臂,提供了一種原則性解決方案,以應對非穩定性,而不依賴於環境穩定性假設。(2) 為了預測獎勵並在多輪對話中概括稀疏反饋,ALSO引入了一個輕量級神經替代模型,從互動歷史中預測獎勵,實現樣本高效的探索和持續的在線適應。在Sotopia基準上的實驗表明,ALSO在動態環境中始終超越靜態基準和現有優化方法,驗證了對抗性在線策略優化在構建穩健社會代理人方面的有效性。

GRASP: Learning to Ground Social Reasoning in Multi-Person Non-Verbal Interactions

2605.15764v1 by Junho Kim, Xu Cao, Houze Yang, Bikram Boote, Ana Jojic, Fiona Ryan, Bolin Lai, Sangmin Lee, James M. Rehg

Understanding social interactions requires reasoning over subtle non-verbal cues, yet current multimodal large language models (MLLMs) often fail to identify who interacts with whom in multi-person videos. We introduce GRASP, a large-scale social reasoning dataset that connects high-level social QA with fine-grained gaze and deictic gesture events. GRASP contains 290K question--answer pairs over 46K videos totaling 749 hours, organized by a 16-category taxonomy spanning gaze, gesture, and joint gaze--gesture reasoning, together with GRASP-Bench for evaluation. Unlike prior resources that focus on either isolated cues or high-level social QA, GRASP builds questions from identity-consistent gaze trajectories, deictic gestures, and their joint compositions into social events. Moreover, we propose Social Grounding Reward (SGR), a learning signal that uses these social events to encourage models to reason about the participants involved in each interaction. Experiments show that SGR improves performance on GRASP-Bench while maintaining zero-shot performance on related social video QA benchmarks.

摘要:理解社交互動需要對微妙的非語言線索進行推理,但目前的多模態大型語言模型(MLLMs)往往無法識別多個人視頻中誰與誰互動。我們介紹了GRASP,一個大規模社交推理數據集,將高層次的社交問答與細緻的凝視和指示性手勢事件相連接。GRASP包含290K個問題--答案對,涵蓋46K個視頻,總計749小時,按照涵蓋凝視、手勢和聯合凝視--手勢推理的16類分類法進行組織,並配有GRASP-Bench進行評估。與之前專注於孤立線索或高層次社交問答的資源不同,GRASP從身份一致的凝視軌跡、指示性手勢及其在社交事件中的聯合組合中構建問題。此外,我們提出了社交基礎獎勵(SGR),這是一種學習信號,利用這些社交事件來鼓勵模型推理每次互動中參與者的情況。實驗表明,SGR在GRASP-Bench上提升了性能,同時在相關的社交視頻問答基準上保持了零樣本性能。

CompactQE: Interpretable Translation Quality Estimation via Small Open-Weight LLMs

2605.15763v1 by Kamil Guttmann, Zofia Fraś, Artur Nowakowski, Krzysztof Jassem

Current state-of-the-art Quality Estimation (QE) in machine translation relies on massive, proprietary LLMs, raising data privacy concerns. We demonstrate that smaller, open-source LLMs (<30B parameters) are a viable, cost-effective and privacy-preserving alternative. Using a single-pass prompting strategy, our models simultaneously generate quality scores, MQM error annotations, suggested error corrections, and full post-editions. Our analysis shows these models achieve highly competitive system-level correlations with human judgments that outperform traditional neural metrics, fine-tuned models, and human inter-annotator agreement, effectively approximating the capabilities of much larger proprietary LLMs.

摘要:目前最先進的機器翻譯品質評估(QE)依賴於龐大的專有大型語言模型(LLMs),這引發了數據隱私的擔憂。我們證明了較小的開源大型語言模型(<30B 參數)是一種可行的、具成本效益且能保護隱私的替代方案。使用單次提示策略,我們的模型同時生成品質分數、MQM 錯誤註釋、建議的錯誤修正以及完整的後編輯。我們的分析顯示這些模型在系統級別上與人類判斷的相關性非常競爭,超越了傳統神經度量、微調模型和人類標註者之間的一致性,有效地接近了更大專有大型語言模型的能力。

DimMem: Dimensional Structuring for Efficient Long-Term Agent Memory

2605.15759v1 by Wentao Qiu, Haotian Hu, Fanyi Wang, Jinwei Kong, Yu Zhang

Large language model (LLM) agents require long-term memory to leverage information from past interactions. However, existing memory systems often face a fidelity--efficiency trade-off: raw dialogue histories are expensive, while flat facts or summaries may discard the structure needed for precise recall. We propose \textbf{DimMem}, a lightweight dimensional memory framework that represents each memory as an atomic, typed, and self-contained unit with explicit fields such as time, location, reason, purpose, and keywords. This representation exposes the structure needed for dimension-aware retrieval, memory update, and selective assistant-context recall without storing full histories in the model context. Across LoCoMo-10 and LongMemEval-S, DimMem achieves \textbf{81.43\%} and \textbf{78.20\%} overall accuracy, respectively, outperforming existing lightweight memory systems while reducing LoCoMo per-query token cost by \textbf{24\%}. We further show that dimensional memory extraction is learnable by compact models: after fine-tuning on the DimMem schema, a Qwen3-4B extractor surpasses LightMem with GPT-4.1-mini on both benchmarks and reaches performance comparable to, or better than, much larger extractors in key settings. These results suggest that explicit dimensional structuring is an effective and efficient foundation for long-term memory in LLM agents. Code is available at https://github.com/ChowRunFa/DimMem.

摘要:大型語言模型(LLM)代理需要長期記憶來利用過去互動中的信息。然而,現有的記憶系統往往面臨忠實度與效率的權衡:原始對話歷史成本高昂,而平面事實或摘要可能會丟失精確回憶所需的結構。我們提出了\textbf{DimMem},一種輕量級的維度記憶框架,將每個記憶表示為一個原子、類型化且自包含的單元,具有明確的字段,例如時間、地點、原因、目的和關鍵詞。這種表示法揭示了進行維度感知檢索、記憶更新和選擇性助手上下文回憶所需的結構,而無需在模型上下文中存儲完整歷史記錄。在LoCoMo-10和LongMemEval-S上,DimMem分別達到\textbf{81.43\%}和\textbf{78.20\%}的整體準確率,超越現有的輕量級記憶系統,同時將LoCoMo每查詢的標記成本降低了\textbf{24\%}。我們進一步顯示,維度記憶提取是可通過緊湊模型學習的:在DimMem架構上進行微調後,Qwen3-4B提取器在兩個基準上超越了LightMem與GPT-4.1-mini,並在關鍵設置中達到與更大提取器可比或更好的性能。這些結果表明,明確的維度結構是一種有效且高效的長期記憶基礎,適用於LLM代理。代碼可在https://github.com/ChowRunFa/DimMem獲得。

BiomedAP: A Vision-Informed Dual-Anchor Framework with Gated Cross-Modal Fusion for Robust Medical Vision-Language Adaptation

2605.15736v1 by Huanyang Tong, Kai Liu, Fangjun Kuang, Huiling Chen

Biomedical Vision--Language Models (VLMs) have shown remarkable promise in few-shot medical diagnosis but face a critical bottleneck: \textit{fragility to prompt variations}.Existing adaptation frameworks typically optimize visual and textual prompts as independent streams, relying on ideal ``Golden Prompts''. In clinical reality, where descriptions are often noisy and heterogeneous, this modality isolation leads to unstable cross-modal alignment. To address this, we propose BiomedAP, a vision-informed dual-anchor framework with gated cross-modal fusion.BiomedAP enforces synergistic alignment through two mechanisms: (1) Gated Cross-Modal Fusion, which enables layer-wise interaction between modalities, acting as a dynamic noise regulator to suppress irrelevant textual cues; and (2) a Dual-Anchor Constraint that regularizes learnable prompts toward stable semantic centroids derived from both expert templates (High Anchors) and few-shot visual prototypes (Low Anchors). Extensive experiments across 11 benchmarks demonstrate that BiomedAP consistently surpasses baselines, achieving competitive few-shot accuracy and markedly enhanced robustness under prompt perturbations. Our code is available at: https://github.com/tongdiedie/BiomedAP. Keywords: Vision-Language Models; Prompt Learning; Parameter-Efficient Fine-Tuning; Few-shot Learning

摘要:生物醫學視覺-語言模型(VLMs)在少量醫療診斷中顯示出顯著的潛力,但面臨一個關鍵瓶頸:\textit{對提示變化的脆弱性}。現有的適應框架通常將視覺和文本提示作為獨立的流進行優化,依賴於理想的“黃金提示”。在臨床現實中,描述往往是嘈雜且異質的,這種模態隔離導致跨模態對齊的不穩定性。為了解決這個問題,我們提出了BiomedAP,一種具有門控跨模態融合的視覺知情雙錨框架。BiomedAP通過兩個機制強化協同對齊:(1)門控跨模態融合,使模態之間的層級互動成為可能,充當動態噪聲調節器以抑制不相關的文本提示;(2)雙錨約束,將可學習的提示正則化為來自專家模板(高錨)和少量視覺原型(低錨)的穩定語義中心。在11個基準上的廣泛實驗表明,BiomedAP始終超越基準,實現了具有競爭力的少量準確性,並在提示擾動下顯著增強了穩健性。我們的代碼可在以下網址獲得:https://github.com/tongdiedie/BiomedAP。關鍵詞:視覺-語言模型;提示學習;參數高效微調;少量學習

UAM: A Dual-Stream Perspective on Forgetting in VLA Training

2605.15735v1 by Jianke Zhang, Yuanfei Luo, Yucheng Hu, Xiaoyu Chen, Yanjiang Guo, Ziyang Liu, Hongbin Xu, Tian Lan, Jianyu Chen

Vision--language--action (VLA) models are typically built by fine-tuning a pretrained vision--language model (VLM) on action data. However, we show that this standard recipe systematically erodes the VLM's multimodal competence, a side effect we call the embodiment tax. But do VLAs have to forget? Inspired by the two-stream organization of biological vision, we trace this degradation to a structural bottleneck: current VLAs ask a single encoder to support both language-grounded semantics and control-relevant visual features, whereas biological vision separates recognition and visuomotor control into distinct pathways. Building on this view, we propose the Unified Action Model (UAM), which adds a parallel Dorsal Expert, an analog of the brain's dorsal pathway. To make the Dorsal Expert an effective second pathway and reduce the control-learning burden on the VLM, we initialize it from a pretrained generative model and train it with a mid-level reasoning objective that predicts visual dynamics. This design allows us to train the whole VLA end-to-end on action data alone: with no parameter freezing, no gradient stopping, and no auxiliary VL co-training, UAM retains over $95\%$ of the underlying VLM's multimodal capability and at the same time achieves the highest average success rate among baselines on a variety of manipulation tasks that probe out-of-distribution generalization, including unseen objects, novel object--target compositions, and instruction variation. Together, these results suggest that semantic preservation in VLAs can emerge from architectural separation itself, rather than being enforced by frozen weights or auxiliary data replay, and that this preserved semantic capability can naturally transfer from VLMs to semantic generalization in actions.

摘要:視覺--語言--行動 (VLA) 模型通常是通過在行動數據上微調預訓練的視覺--語言模型 (VLM) 來構建的。然而,我們顯示這一標準方法系統性地侵蝕了 VLM 的多模態能力,我們稱這一副作用為具身稅。但是 VLA 必須忘記嗎?受到生物視覺的雙流組織啟發,我們將這一退化追溯到一個結構瓶頸:當前的 VLA 要求單一編碼器同時支持語言基礎的語義和控制相關的視覺特徵,而生物視覺則將識別和視動控制分為不同的通路。基於這一觀點,我們提出了統一行動模型 (UAM),它增加了一個平行的背側專家,這是大腦背側通路的類比。為了使背側專家成為有效的第二通路並減少對 VLM 的控制學習負擔,我們從預訓練的生成模型初始化它,並以一個中階推理目標進行訓練,該目標預測視覺動態。這一設計使我們能夠僅在行動數據上端到端地訓練整個 VLA:在不凍結參數、不停止梯度和不進行輔助 VL 共同訓練的情況下,UAM 保留了超過 $95\%$ 的基礎 VLM 的多模態能力,同時在各種操作任務的基準中實現了最高的平均成功率,這些任務探測了分佈外的泛化,包括未見物體、新物體--目標組合和指令變化。綜合這些結果表明,VLA 中的語義保留可以源自於架構的分離本身,而不是通過凍結權重或輔助數據重播來強制實現,並且這一保留的語義能力可以自然地從 VLM 轉移到行動中的語義泛化。

Can We Trust AI-Inferred User States. A Psychometric Framework for Validating the Reliability of Users States Classification by LLMs in Operational Environments

2605.15734v1 by Izabella Krzeminska, Michal Butkiewicz, Ewa Komkowska

The use of large language models to assess user states in conversational and adaptive systems is based on the assumption that the metrics used for such assessment are stable and interpretable at the level of individual scores. This paper empirically tests this assumption, focusing on the psychometric reliability of artificial intelligence (AI) measures of user states. This study employed replication evaluation procedures to assess the repeatability of a broad set of metrics across three different bimodal large language models (GPT-4o audio, Gemini 2.0 Flash, Gemini 2.5 Flash). Analyses include both individual score reliability and aggregated reliability, allowing us to distinguish metrics potentially useful for real-time adaptation from those that retain their value only in aggregated analyses. The results demonstrate that metric reliability cannot be considered a default property in interpretive domains. The lack of stability at the level of individual scores precludes the interpretation of such scores as indicators of user state in real-time adaptive systems, even if these metrics demonstrate stability after aggregation. At the same time, the study indicates that individually unstable metrics can retain analytical utility in post-hoc studies, identifying rules governing interactions and their relationships with user experience parameters such as satisfaction, trust, and engagement. The main contribution of this work, besides quantifying the severity of the problem (only 31 of 213 metrics met the criteria), is the proposal of a replicable evaluation framework, enabling measurable evaluations of metric applicability. This approach supports more responsible AI design of adaptive systems, in which the interpretation of results requires explicit validation of reliability and monitoring for violations over time.

摘要:使用大型語言模型來評估對話和自適應系統中的用戶狀態是基於這樣的假設:用於此類評估的指標在個別分數層面上是穩定且可解釋的。本文實證測試了這一假設,重點關注人工智能(AI)在用戶狀態測量中的心理測量可靠性。 本研究採用了重複評估程序,以評估三種不同的雙模態大型語言模型(GPT-4o音頻、Gemini 2.0 Flash、Gemini 2.5 Flash)中一組廣泛指標的重複性。分析包括個別分數的可靠性和聚合可靠性,使我們能夠區分對於實時適應可能有用的指標和僅在聚合分析中保留其價值的指標。 結果顯示,指標可靠性不能被視為解釋領域中的默認屬性。在個別分數層面缺乏穩定性使得無法將這些分數解釋為實時自適應系統中用戶狀態的指標,即使這些指標在聚合後顯示出穩定性。同時,研究表明,個別不穩定的指標在事後研究中仍然可以保留分析效用,識別規範互動的規則及其與用戶體驗參數(如滿意度、信任和參與度)的關係。 這項工作的主要貢獻,除了量化問題的嚴重性(只有213個指標中有31個符合標準)外,還提出了一個可重複的評估框架,使指標適用性的可測評估成為可能。這種方法支持更負責任的自適應系統AI設計,其中結果的解釋需要明確的可靠性驗證和隨時間的違規監控。

Structure Abstraction and Generalization in a Hippocampal-Entorhinal Inspired World Model

2605.15733v1 by Tianqiu Zhang, Muyang Lyu, Xiao Liu, Si Wu

Humans abstract experiences into structured representations to facilitate pattern inference and knowledge transfer. While the hippocampal-entorhinal (HPC-MEC) circuit is known to represent both spatial and conceptual spaces, the mechanisms for concurrently extracting abstract structures from continuous, high-dimensional dynamics remain poorly understood. We propose a brain-inspired hierarchical model that simultaneously infers latent transitions and constructs a predictive visual world model. Our architecture employs an inverse model for structural extraction alongside an HPC-MEC coupling model that dissociates relational structures (MEC) from integrated episodic scenes (HPC). Using primitive transformation dynamics as a benchmark, we demonstrate the model's capacity for structural abstraction. By leveraging velocity-driven path integration, the framework enables robust prediction and structural reuse across diverse contexts, thereby achieving structural generalization. This work provides a novel computational framework for understanding how brain-inspired, self-supervised learning of world models facilitates the acquisition of reusable abstract knowledge.

摘要:人類將經驗抽象為結構化的表徵,以促進模式推斷和知識轉移。雖然海馬-內嗅皮層(HPC-MEC)電路已知能夠表徵空間和概念空間,但同時從連續的高維動態中提取抽象結構的機制仍然不甚了解。我們提出了一種受大腦啟發的階層模型,該模型同時推斷潛在轉變並構建預測視覺世界模型。我們的架構使用逆模型進行結構提取,並結合一個HPC-MEC耦合模型,將關聯結構(MEC)與整合的情節場景(HPC)分離。使用原始轉換動態作為基準,我們展示了該模型的結構抽象能力。通過利用速度驅動的路徑整合,該框架實現了在多樣化上下文中的穩健預測和結構重用,從而達成結構泛化。這項工作提供了一個新穎的計算框架,以理解受大腦啟發的自我監督學習如何促進可重用抽象知識的獲取。

DecomPose: Disentangling Cross-Category Optimization Contention for Category-Level 6D Object Pose Estimation

2605.15728v1 by Yifan Gao, Lu Zou, Zhangjin Huang, Guoping Wang

Category-level 6D object pose estimation is typically formulated as a multi-category joint learning problem with fully shared model parameters. However, pronounced geometric heterogeneity across categories entangles incompatible optimization signals in shared modules, resulting in gradient conflicts and negative transfer during training. To address this challenge, we first introduce gradient-based diagnostics to quantify module-level cross-category contention. Building on results of diagnostics, we propose DecomPose, a difficulty-aware decomposition framework that mitigates optimization contention via: (1) difficulty-aware gradient decoupling, which groups categories using a data-driven difficulty proxy and routes each instance to a group-specific correspondence branch to isolate incompatible updates; and (2) stability-driven asymmetric branching, which assigns higher-capacity branches to structurally simple categories as stable optimization anchors while constraining complex categories with lightweight branches to suppress noisy updates and alleviate negative transfer. Extensive experiments on REAL275, CAMERA25, and HouseCat6D demonstrate that DecomPose effectively reduces cross-category optimization contention and delivers superior pose estimation performance across multiple benchmarks.

摘要:類別級 6D 物體姿態估計通常被表述為一個多類別的聯合學習問題,並且模型參數完全共享。然而,類別之間顯著的幾何異質性使得共享模塊中出現不相容的優化信號,導致訓練過程中的梯度衝突和負轉移。為了解決這一挑戰,我們首先引入基於梯度的診斷來量化模塊級別的跨類別競爭。基於診斷結果,我們提出了 DecomPose,一個具備難度感知的分解框架,通過以下方式減輕優化競爭:(1) 難度感知的梯度解耦,使用數據驅動的難度代理將類別分組,並將每個實例路由到特定組別的對應分支以隔離不相容的更新;(2) 稳定性驅動的非對稱分支,將更高容量的分支分配給結構簡單的類別作為穩定的優化錨點,同時用輕量級的分支約束複雜類別,以抑制噪聲更新並減輕負轉移。在 REAL275、CAMERA25 和 HouseCat6D 上進行的廣泛實驗表明,DecomPose 有效減少了跨類別的優化競爭,並在多個基準上提供了卓越的姿態估計性能。

Nudging Beyond the Comfort Zone: Efficient Strategy-Guided Exploration for RLVR

2605.15726v1 by Chanuk Lee, Sangwoo Park, Minki Kang, Sung Ju Hwang

Reinforcement learning with verifiable rewards (RLVR) has emerged as a scalable paradigm for improving the reasoning capabilities of large language models. However, its effectiveness is fundamentally limited by exploration: the policy can only improve on trajectories it has already sampled. While increasing the number of rollouts alleviates this issue, such brute-force scaling is computationally expensive, and existing approaches that modify the optimization objective provide limited control over what is explored. In this work, we propose NudgeRL, a framework for structured and diversity-driven exploration in RLVR. Our approach introduces Strategy Nudging, which conditions each rollout on lightweight, strategy-level contexts to induce diverse reasoning trajectories without relying on expensive oracle supervision. To effectively learn from such structured exploration, we further propose a unified objective, which decomposes the reward signal into inter- and intra-context components and incorporates a distillation objective to transfer discovered behaviors back to the base policy. Empirically, NudgeRL outperforms standard GRPO with up to 8 times larger rollout budgets, while outperforming oracle-guided RL baseline on average across five challenging math benchmarks. These results demonstrate that structured, context-driven exploration can serve as an efficient and scalable alternative to both brute-force rollout scaling and feasibility-oriented methods based on privileged information. Our code is available at https://github.com/tally0818/NudgeRL.

摘要:強化學習與可驗證獎勵(RLVR)已成為提升大型語言模型推理能力的可擴展範式。然而,其有效性基本上受到探索的限制:策略只能在已經採樣的軌跡上進行改進。雖然增加回合次數可以緩解這個問題,但這種粗暴的擴展在計算上是昂貴的,而現有的修改優化目標的方法對於探索的控制有限。在這項工作中,我們提出了 NudgeRL,這是一個在 RLVR 中進行結構化和多樣性驅動探索的框架。我們的方法引入了策略推動,這使得每個回合都基於輕量級的策略級上下文來誘導多樣的推理軌跡,而不依賴於昂貴的預言機監督。為了有效地從這種結構化探索中學習,我們進一步提出了一個統一目標,將獎勵信號分解為跨上下文和內部上下文組件,並納入一個蒸餾目標以將發現的行為轉移回基礎策略。實證結果表明,NudgeRL 在回合預算高達 8 倍的情況下超越了標準 GRPO,同時在五個具有挑戰性的數學基準上平均超越了預言機引導的 RL 基線。這些結果表明,結構化的、以上下文為驅動的探索可以作為粗暴回合擴展和基於特權信息的可行性導向方法的有效且可擴展的替代方案。我們的代碼可在 https://github.com/tally0818/NudgeRL 獲得。

DiLA: Disentangled Latent Action World Models

2605.15725v1 by Tianqiu Zhang, Muyang Lyu, Yufan Zhang, Fang Fang, Si Wu

Latent Action Models (LAMs) enable the learning of world models from unlabeled video by inferring abstract actions between consecutive frames. However, LAMs face a fundamental trade-off between action abstraction and generation fidelity. Existing methods typically circumvent this issue by using two-stage training with pre-trained world models or by limiting predictions to optical flow. In this paper, we introduce DiLA, a novel Disentangled Latent Action world model that aims to resolve this trade-off via content-structure disentanglement. Our key insight is that disentanglement and latent action learning are co-evolving: the predictive bottleneck inherent in latent action learning serves as a driving force for disentanglement, compelling the model to distill spatial layouts into the structure pathway while offloading visual details to a separate content pathway for generation. This synergy yields a continuous, semantically structured latent action space without compromising generative quality. DiLA achieves superior results in video generation quality, action transfer, visual planning, and manifold interpretability. These findings establish DiLA as a unified framework that simultaneously achieves high-level action abstraction and high-fidelity generation, advancing the frontier of self-supervised world model learning.

摘要:潛在行動模型(LAMs)通過推斷連續幀之間的抽象行動,使得從未標記視頻中學習世界模型成為可能。然而,LAMs面臨著行動抽象與生成真實性之間的基本權衡。現有的方法通常通過使用兩階段訓練與預訓練的世界模型,或通過將預測限制在光流上來迴避這個問題。在本文中,我們介紹了DiLA,一種新穎的解耦潛在行動世界模型,旨在通過內容-結構解耦來解決這一權衡。我們的關鍵見解是,解耦與潛在行動學習是共同演進的:潛在行動學習中固有的預測瓶頸成為了解耦的驅動力,迫使模型將空間佈局提煉到結構通道,同時將視覺細節卸載到單獨的內容通道以進行生成。這種協同作用產生了一個連續的、語義結構化的潛在行動空間,而不妥協生成質量。DiLA在視頻生成質量、行動轉移、視覺規劃和多樣性可解釋性方面取得了優越的結果。這些發現確立了DiLA作為一個統一框架,同時實現高級行動抽象和高保真生成,推進了自監督世界模型學習的前沿。

Bidirectional Fusion Guided by Cardiac Patterns for Semi-Supervised ECG Segmentation

2605.15722v1 by Jeonghwa Lim, Minje Park, Sunghoon Joo

Accurate delineation of electrocardiogram (ECG), the segmentation of meaningful waveform features, is crucial for cardiovascular diagnostics. However, the scarcity of annotated data poses a significant challenge for training deep learning models. Conventional semi-supervised semantic segmentation (SemiSeg) methods primarily focus on consistency from unlabeled data, underutilizing the information exchange possible between labeled and unlabeled sets. To address this, we introduce CardioMix, a framework built on a bidirectional CutMix strategy guided by cardiac patterns for ECG segmentation. This approach enriches the labeled set with realistic variations from unlabeled data while simultaneously applying stronger supervisory signals to the unlabeled set, as the cardiac pattern-guided mixing ensures all augmented samples remain physiologically meaningful. Our framework is designed as a plug-and-play module, demonstrating high compatibility with various SemiSeg algorithms. Extensive experiments on SemiSegECG, a public multi-dataset benchmark for ECG delineation, demonstrate that CardioMix consistently outperforms existing CutMix-based fusion strategies across diverse datasets and labeled ratios as a plug-and-play module compatible with various SemiSeg algorithms.

摘要:準確地劃分心電圖 (ECG),即有意義波形特徵的分割,對於心血管診斷至關重要。然而,標註數據的稀缺對訓練深度學習模型構成了重大挑戰。傳統的半監督語義分割 (SemiSeg) 方法主要集中在來自未標註數據的一致性,未充分利用標註與未標註數據集之間可能的信息交換。為了解決這個問題,我們提出了 CardioMix,一個基於雙向 CutMix 策略的框架,該策略由心臟模式指導以進行 ECG 分割。這種方法通過從未標註數據中引入現實變化來豐富標註數據集,同時對未標註數據集施加更強的監督信號,因為心臟模式指導的混合確保所有增強樣本保持生理上有意義。我們的框架設計為即插即用模塊,顯示出與各種 SemiSeg 算法的高度兼容性。在 SemiSegECG 上進行的廣泛實驗,這是一個公共多數據集基準,用於 ECG 劃分,證明 CardioMix 在各種數據集和標註比例上始終優於現有的基於 CutMix 的融合策略,作為與各種 SemiSeg 算法兼容的即插即用模塊。

Contexting as Recommendation: Evolutionary Collaborative Filtering for Context Engineering

2605.15721v1 by Jiachen Zhu, Zhuoying Ou, Congmin Zheng, Yuxiang Chen, Zeyu Zheng, Rong Shan, Lingyu Yang, Lionel Z. Wang, Weiwen Liu, Yong Yu, Weinan Zhang, Jianghao Lin

Large Language Models (LLMs) are highly sensitive to their input contexts, motivating the development of automated context engineering. However, existing methods predominantly treat this as a global search problem, seeking a single context strategy that maximizes average performance across a dataset. This restrictive assumption overlooks the fact that different inputs often require distinct guidance, leaving substantial instance-level performance gains untapped. In this paper, we propose a paradigm shift by formulating context engineering as a recommendation problem. We introduce \textbf{Neural Collaborative Context Engineering (NCCE)}, a framework that transitions optimization from a static global search to dynamic, instance-wise routing. NCCE first bootstraps a diverse catalog of anchor contexts and then employs a novel \textbf{Context-CF Co-Evolution} mechanism. This stage establishes a synergistic feedback loop: a lightweight Neural Collaborative Filtering (NCF) model learns instance-context preferences to guide the generation of specialized context variants, while the newly evaluated contexts continuously refine the NCF model's understanding of latent preferences. At inference time, the trained NCF model acts as a context router, dynamically assigning the most suitable context strategy to each unseen instance. Theoretical Proofs and comprehensive experiments demonstrate that by matching individual inputs with their optimal contexts, NCCE significantly improves task accuracy, highlighting the critical importance of personalization in LLM context engineering.

摘要:大型語言模型(LLMs)對其輸入上下文高度敏感,這促進了自動化上下文工程的發展。然而,現有的方法主要將其視為一個全局搜索問題,尋求一個能最大化數據集平均性能的單一上下文策略。這一限制性假設忽視了不同輸入往往需要不同指導的事實,導致大量實例級別的性能提升未被挖掘。在本文中,我們通過將上下文工程形式化為推薦問題來提出一種範式轉變。我們介紹了\textbf{神經協作上下文工程(NCCE)},這是一個將優化從靜態全局搜索轉變為動態實例路由的框架。NCCE首先啟動一個多樣化的錨點上下文目錄,然後採用一種新穎的\textbf{上下文-CF共同進化}機制。這一階段建立了一個協同反饋循環:一個輕量級的神經協作過濾(NCF)模型學習實例-上下文偏好,以指導專門上下文變體的生成,而新評估的上下文則不斷完善NCF模型對潛在偏好的理解。在推理時,訓練好的NCF模型充當上下文路由器,動態地將最適合的上下文策略分配給每個未見實例。理論證明和全面實驗表明,通過將個別輸入與其最佳上下文匹配,NCCE顯著提高了任務準確性,突顯了個性化在LLM上下文工程中的關鍵重要性。

Position: Early-Stage Quality Assurance in Annotation Pipelines Is More Cost-Effective Than Late-Stage Validation

2605.15714v1 by Sunil Kothari, Sumukha Sharma Thoppanahalli Chandramouli, Naman Khandelwal, Parth Kulshreshtha, Ashi Jain, Kriti Banka, Tanuja Chintada, Venkata Triveni, Gulipalli Praveen Kumar, Manish Mehta, Tao Liu

This position paper argues that the machine learning community should prioritize early-stage quality assurance in annotation pipelines over the prevailing practice of late-stage validation. Data quality bottlenecks increasingly limit foundation model improvement, yet quality assurance research focuses almost exclusively on validation methods rather than validation timing. When validation occurs, not merely what methods are employed, fundamentally determines both error rates and annotation costs. This temporal neglect is puzzling given the well-established "shift-left" principle from software engineering, where empirical studies demonstrate 4--100x cost multipliers for defects detected in later stages (Boehm, 1981; Shull et al., 2002). Annotation pipelines exhibit analogous dynamics: errors caught before annotation begins cost a fraction of those discovered after review cycles complete. We propose a taxonomy of three QA trigger points, namely pre-annotation (T0), post-annotation (T1), and post-review (T2), that decompose annotation workflows into discrete validation opportunities. A parametric error-propagation model formalizes when timing affects final error rates versus only economics, making timing a measurable design variable rather than a configuration afterthought. A survey of 47 recent papers reveals that only 4% report when validation occurs, a striking gap given timing's demonstrated impact in adjacent fields. Without explicit attention to QA timing, the community risks optimizing validation methods while ignoring the structural variable that may matter most. Acting on this position requires three steps: researchers should report QA timing configurations alongside validation methods; annotation platforms should expose timing as a first-class parameter; and the community should run controlled experiments that measure stage-specific detection rates directly.

摘要:這份立場文件主張,機器學習社群應該優先考慮標註流程中的早期質量保證,而非當前普遍的晚期驗證做法。數據質量瓶頸日益限制基礎模型的改進,然而質量保證研究幾乎專注於驗證方法,而非驗證時機。當驗證發生時,不僅僅是使用了什麼方法,根本上決定了錯誤率和標註成本。考慮到軟體工程中已確立的「向左移動」原則,這種時間上的忽視令人困惑,因為實證研究顯示,在後期階段檢測到的缺陷成本增加了4到100倍(Boehm, 1981; Shull et al., 2002)。標註流程表現出類似的動態:在標註開始之前捕捉到的錯誤成本僅為在審核循環完成後發現的錯誤的一小部分。我們提出了一個三個質量保證觸發點的分類法,即標註前(T0)、標註後(T1)和審核後(T2),將標註工作流程分解為離散的驗證機會。一個參數化的錯誤傳播模型形式化了何時時機影響最終錯誤率與僅影響經濟學,使得時機成為一個可測量的設計變數,而不是事後考慮的配置。一項對47篇近期論文的調查顯示,只有4%報告了驗證發生的時機,考慮到時機在相鄰領域的顯著影響,這是一個驚人的差距。如果不明確關注質量保證的時機,社群將面臨優化驗證方法的風險,同時忽視了可能最重要的結構變數。採取這一立場需要三個步驟:研究人員應該報告質量保證的時機配置以及驗證方法;標註平台應該將時機作為一個一級參數暴露出來;社群應該進行控制實驗,直接測量特定階段的檢測率。

Feedback World Model Enables Precise Guidance of Diffusion Policy

2605.15705v1 by Tuo An, Jindou Jia, Gen Li, Jingliang Li, Chuhao Zhou, Pengfei Liu, Bofan Lyu, Jiaqi Bai, Xinying Guo, Geng Li, Jianfei Yang

World models aim to improve robotic decision making by predicting the consequences of actions. However, in practice, their predictions often become unreliable once the robot encounters states outside the training distribution, limiting their effectiveness at deployment. We observe that execution itself provides a natural but underutilized signal: after each action, the robot directly observes the true next state, revealing the mismatch between predicted and actual outcomes. Building on this insight, we propose feedback world model, a new paradigm that closes the loop between prediction and observation at inference time. Instead of treating the world model as a static open-loop predictor, our method maintains a lightweight feedback state that is updated online to iteratively correct future predictions, compensating for model errors using real-time observations without additional training data or parameter updates. We show that this process can be interpreted as a latent-space observer and admits convergence guarantees under mild conditions. We further introduce action-aware guidance to better translate corrected predictions into control by emphasizing action-controllable components while suppressing irrelevant variations. Experiments on LIBERO-Plus, Robomimic, and real-world manipulation tasks demonstrate that our method substantially improves both prediction accuracy and policy performance under distribution shift. In particular, it reduces world model prediction error by up to 76.4% and improves out-of-distribution (OOD) success rate by 30%. These results show that incorporating real-time feedback at inference time provides a simple yet powerful alternative to static world modeling.

摘要:世界模型旨在通過預測行動的後果來改善機器人的決策能力。然而,在實踐中,當機器人遇到訓練分佈之外的狀態時,它們的預測往往變得不可靠,限制了它們在部署時的有效性。我們觀察到執行本身提供了一個自然但未充分利用的信號:在每次行動後,機器人直接觀察到真實的下一狀態,揭示了預測結果與實際結果之間的差異。基於這一見解,我們提出了反饋世界模型,一種在推理時閉合預測與觀察之間循環的新範式。我們的方法不將世界模型視為靜態的開環預測器,而是維持一個輕量級的反饋狀態,該狀態在線更新以迭代地修正未來的預測,利用實時觀察來補償模型錯誤,而不需要額外的訓練數據或參數更新。我們展示了這一過程可以被解釋為潛在空間觀察者,並在溫和條件下承認收斂保證。我們進一步引入了行動感知指導,以更好地將修正的預測轉化為控制,通過強調可控行動組件,同時抑制無關變化。在 LIBERO-Plus、Robomimic 和現實世界操作任務上的實驗表明,我們的方法在分佈轉移下顯著提高了預測準確性和策略性能。特別是,它將世界模型預測誤差降低了高達 76.4%,並提高了分佈外(OOD)成功率 30%。這些結果表明,在推理時引入實時反饋提供了一種簡單而強大的靜態世界建模替代方案。

H-Mem: A Novel Memory Mechanism for Evolving and Retrieving Agent Memory via a Hybrid Structure

2605.15701v1 by Jiawei Yu, Yixiang Fang, Xilin Liu, Yuchi Ma

Memory data are ubiquitous in Large Language Model (LLM)-based agents (e.g., OpenClaw and Manus). A few recent works have attempted to exploit agents'memory for improving their performance on the question-answering (QA) task, but they lack a principled mechanism for effectively modeling how memory data evolves over time and retrieving memory data effectively, leading to poor performance in memory utilization. To fill this gap, we present H-Mem, a novel memory mechanism via a hybrid structure that can not only effectively model the evolution of agent memory over a long period of time, but also provide an efficient memory retrieval approach. Particularly, H-Mem builds a temporal and semantic tree structure that allows the short-term memory data to evolve progressively into long-term memory data, where the latter provides summarized information about the former, while simultaneously constructing a knowledge graph to capture the relationships between entities in memory. Moreover, it offers an effective memory retrieval approach by exploiting the hybrid structure of the tree and graph structures. Extensive experiments on three agent memory benchmarks show that H-Mem achieves state-of-the-art performance on the QA task.

摘要:記憶數據在基於大型語言模型(LLM)的代理中無處不在(例如,OpenClaw 和 Manus)。一些最近的研究嘗試利用代理的記憶來提高其在問答(QA)任務上的表現,但它們缺乏一種原則性機制來有效建模記憶數據隨時間的演變以及有效檢索記憶數據,導致記憶利用率低下。為了填補這一空白,我們提出了 H-Mem,一種新穎的記憶機制,通過混合結構不僅能有效建模代理記憶在長時間內的演變,還能提供高效的記憶檢索方法。特別是,H-Mem 建立了一個時間和語義樹結構,使短期記憶數據能逐步演變為長期記憶數據,後者提供了有關前者的摘要信息,同時構建了一個知識圖譜以捕捉記憶中實體之間的關係。此外,它通過利用樹和圖結構的混合結構提供了一種有效的記憶檢索方法。在三個代理記憶基準上的廣泛實驗顯示,H-Mem 在 QA 任務上達到了最先進的性能。

ASRU: Activation Steering Meets Reinforcement Unlearning for Multimodal Large Language Models

2605.15687v1 by Jiahui Guang, Yingjie Zhu, Cuiyun Gao, Haiyan Wang, Jing Li, Di Shao, Zhaoquan Gu

Multimodal large language models (MLLMs) may memorize sensitive cross-modal information during pretraining, making machine unlearning (MU) crucial. Existing methods typically evaluate unlearning effectiveness based on output deviations, while overlooking the generation quality after unlearning. This can easily lead to hallucinated or rigid responses, thereby affecting the usability and safety of the unlearned model. To address this issue, we propose ASRU, a controllable multimodal unlearning framework that incorporates generation quality as a core evaluation objective. ASRU first induces initial refusal behavior through activation redirection, and then optimizes fine-grained refusal boundaries using a customized reward function, thereby achieving a better trade-off between target knowledge unlearning and model utility. Experiments on Qwen3-VL show that ASRU significantly improves unlearning effectiveness (+24.6%) on average and generation quality (5.8x) on average while effectively preserving model utility, using only a small amount of retained supervision data.

摘要:多模態大型語言模型(MLLMs)在預訓練期間可能會記住敏感的跨模態信息,使得機器遺忘(MU)變得至關重要。現有的方法通常基於輸出偏差來評估遺忘的有效性,而忽略了遺忘後的生成質量。這很容易導致產生幻覺或僵化的回應,從而影響未學習模型的可用性和安全性。為了解決這個問題,我們提出了ASRU,一個可控的多模態遺忘框架,將生成質量作為核心評估目標。ASRU首先通過激活重定向誘導初始拒絕行為,然後使用自定義獎勵函數優化細粒度的拒絕邊界,從而在目標知識遺忘和模型效用之間實現更好的權衡。在Qwen3-VL上的實驗顯示,ASRU顯著提高了遺忘的有效性(平均+24.6%)和生成質量(平均5.8倍),同時有效保留模型效用,僅使用少量保留的監督數據。

Few-Shot Large Language Models for Actionable Triage Categorization of Online Patient Inquiries

2605.15680v1 by Liqi Zhou, Jiafu Li

Online patient inquiries are often informal, incomplete, and written before professional assessment, yet they must still be routed to an appropriate level of clinical follow-up. We study this as a four-class actionable triage task -- self-care, schedule-visit, urgent-clinician-review, or emergency-referral, and ask whether prompted large language models (LLMs) can support such routing under low-resource labeling conditions. Using the public HealthCareMagic-100K corpus, we construct a 300-example human calibrated gold evaluation set, a 700-example auto-labeled silver training set, and a 40-example few-shot pool. We compare Term Frequency-Inverse Document Frequency (TF-IDF) and Bidirectional Encoder Representations from Transformers for Biomedical Text Mining (BioBERT) baselines train on silver labels against six prompted LLMs under 0-shot, 4-shot, and 12-shot conditions respectively. Accordingly, we evaluate with macro-$F_1$ alongside safety-aware metrics, including emergency-recall, under-triage rate, and severe under-triage rate. The strongest LLM (Claude Haiku 4.5, 12-shot) reaches macro-$F_1$ 0.475, exceeding the best supervised baseline (BioBERT, 0.378) on point estimate, with overlapping confidence intervals. Few-shot prompting and two-model agreement help in label-dependent ways: self-care agreement is reliable, urgent-clinician-review is not. We conclude that LLMs can support triage prioritization and selective human review, but not autonomous deployment.

摘要:線上病患詢問通常是非正式的、不完整的,並且在專業評估之前撰寫,然而它們仍然必須被引導至適當的臨床後續跟進層級。我們將這個問題研究為一個四類可行的分診任務——自我照護、預約就診、緊急醫生審查或緊急轉診,並詢問在低資源標記條件下,提示的大型語言模型(LLMs)是否能支持這種引導。使用公共的 HealthCareMagic-100K 語料庫,我們構建了一個 300 範例的人類校準金標準評估集、一個 700 範例的自動標記銀標準訓練集,以及一個 40 範例的少量樣本池。我們比較了基於銀標籤訓練的詞頻-逆文檔頻率(TF-IDF)和生物醫學文本挖掘的雙向編碼器表示(BioBERT)基準,與六個在 0-shot、4-shot 和 12-shot 條件下的提示 LLM 進行比較。因此,我們使用宏觀-$F_1$ 以及安全意識指標進行評估,包括緊急召回、低分診率和嚴重低分診率。最強的 LLM(Claude Haiku 4.5,12-shot)達到宏觀-$F_1$ 0.475,超過最佳的監督基準(BioBERT,0.378)在點估計上,且置信區間重疊。少量樣本提示和兩模型一致性在標籤依賴的方式上有所幫助:自我照護的一致性是可靠的,緊急醫生審查則不是。我們得出結論,LLMs 可以支持分診優先排序和選擇性的人類審查,但不能進行自主部署。

VCG-Bench: Towards A Unified Visual-Centric Benchmark for Structured Generation and Editing

2605.15677v1 by Xiaoyan Su, Peijie Dong, Zhenheng Tang, Song Tang, Yuyao Zhai, Kaitao Lin, Liang Chen, Gai Yuhang, Yuyu Luo, Qiang Wang, Xiaowen Chu

Despite the rapid advancements in Vision-Language Models (VLMs), a critical gap remains in their ability to handle structured, controllable diagrammatic tasks essential for professional workflows. Existing methods predominantly rely on pixel-based synthesis, which operates in probabilistic pixel spaces and is inherently limited in editability and fidelity. Instead, we propose a new Diagram-as-Code paradigm with symbolic logic that leverages mxGraph Extensible Markup Language (XML) for precise diagram generation and editing. We present VCG-Bench, a unified benchmark for visual-centric \texttt{mxGraph} tasks. VCG-Bench comprises: (1) a taxonomized dataset of 1,449 diverse diagrams spanning 6 domains and 15 sub-domains, (2) a paradigm definition that integrates Generation (Vision-to-Code) and Editability (Code-to-Code), (3) a Tailored Evaluation Protocol employing multi-dimensional metrics such as \texttt{mxGraph} Execution Success Rate, Style Consistency Score (SCS), etc. Experimental results highlight the challenges faced by current State-of-the-Art (SOTA) VLMs in structured fidelity and instruction compliance, reflecting their vision and reasoning capabilities.

摘要:儘管視覺-語言模型(VLMs)迅速進步,但在處理對專業工作流程至關重要的結構化、可控圖示任務方面仍存在一個關鍵的空白。現有的方法主要依賴基於像素的合成,這在概率像素空間中運作,並在可編輯性和真實性上固有地有限。相反,我們提出了一種新的圖示即代碼(Diagram-as-Code)範式,利用符號邏輯,並利用 mxGraph 可擴展標記語言(XML)進行精確的圖示生成和編輯。我們呈現了 VCG-Bench,一個統一的視覺中心 \texttt{mxGraph} 任務基準。VCG-Bench 包含:(1)一個經過分類的數據集,包含 1,449 個多樣化的圖示,涵蓋 6 個領域和 15 個子領域;(2)一個範式定義,整合了生成(Vision-to-Code)和可編輯性(Code-to-Code);(3)一個量身定制的評估協議,採用多維度指標,如 \texttt{mxGraph} 執行成功率、風格一致性分數(SCS)等。實驗結果突顯了當前最先進(SOTA)VLMs 在結構化真實性和指令遵循方面面臨的挑戰,反映了它們的視覺和推理能力。

Dynamic Chunking for Diffusion Language Models

2605.15676v1 by Yichen Zhu, Xiaoming Shi, Peng Zhao, Weiyu Chen, Debing Zhang, James Kwok

Block discrete diffusion language models factorize a sequence autoregressively over fixed-size positional blocks, decoupling within-block parallel denoising from across-block conditioning. We argue that this rigid partition wastes structure already present in the sequence: blocks defined by position rather than by content separate semantically coherent tokens and group unrelated ones together. We introduce the \textbf{D}ynamic \textbf{C}hunking \textbf{D}iffusion \textbf{M}odel (DCDM), which replaces positional blocks with content-defined semantic chunks. At its core is Chunking Attention, a differentiable layer that routes tokens into $K$ clusters parameterized by learnable subspaces and shaped end-to-end by the diffusion objective. The resulting cluster assignments induce a chunk-causal attention mask under which a discrete diffusion denoiser factorizes the sequence likelihood autoregressively over semantic chunks, strictly generalizing block discrete diffusion. On downstream benchmarks at parameter scales up to 1.5B, DCDM consistently improves over both unstructured and positional-block diffusion baselines, with the advantage stable across scales and visible early in training.

摘要:區塊離散擴散語言模型自回歸地在固定大小的位置信息區塊上對序列進行因式分解,將區塊內的並行去噪與區塊間的條件解耦。我們認為這種僵化的劃分浪費了序列中已存在的結構:由位置而非內容定義的區塊將語義上相干的標記分開,並將無關的標記聚集在一起。我們引入了\textbf{D}ynamic \textbf{C}hunking \textbf{D}iffusion \textbf{M}odel (DCDM),它用內容定義的語義塊替代了位置區塊。其核心是Chunking Attention,一個可微分的層,將標記路由到由可學習子空間參數化的$K$個集群中,並通過擴散目標端到端地塑造。生成的集群分配在其下引入了一個塊因果注意力掩碼,根據該掩碼,離散擴散去噪器自回歸地在語義塊上對序列的似然進行因式分解,嚴格地推廣了區塊離散擴散。在參數規模高達1.5B的下游基準測試中,DCDM在無結構和位置區塊擴散基準上持續改進,這一優勢在不同規模間穩定,並在訓練早期即可見。

Interaction-Aware Influence Functions for Group Attribution

2605.15675v1 by Jaeseung Heo, Kyeongheung Yun, Youngbin Choi, Sehyun Hwang, Jungseul Ok, Dongwoo Kim

Influence functions approximate how removing a training example changes a quantity of interest, called the target function, such as a held-out loss. To estimate the influence of a group of examples, the standard practice is to sum the individual influences of its members. However, this sum does not capture how examples jointly affect the target: a pair of examples may be redundant or complementary, but the sum cannot distinguish these cases. We propose an interaction-aware influence function that characterizes how interactions between examples influence the target. By expanding the target to second order around the trained parameters, we obtain an estimator that augments the standard sum with a pairwise interaction term that captures the alignment between two examples' effects on the target. We empirically evaluate our estimator in two settings. First, on six dataset-model pairs spanning logistic regression, MLPs, and ResNet-9, our estimator tracks leave-group-out retraining substantially better than first-order influence across all settings. Second, when used as a greedy selection rule for instruction-tuning data on Llama-3.1-8B, it beats prior influence-based and representation-similarity baselines on five of seven downstream tasks, in a regime where standard influence-based selection underperforms random selection.

摘要:影響函數近似於移除一個訓練範例如何改變一個感興趣的量,稱為目標函數,例如保留的損失。要估計一組範例的影響,標準做法是將其成員的個別影響相加。然而,這個總和並不能捕捉範例如何共同影響目標:一對範例可能是冗餘的或互補的,但總和無法區分這些情況。我們提出了一種考慮互動的影響函數,描述範例之間的互動如何影響目標。通過在訓練參數周圍將目標擴展到二階,我們獲得了一個估計量,該估計量用一個成對互動項增強了標準總和,捕捉了兩個範例對目標影響的對齊。我們在兩個設置中對我們的估計量進行了實證評估。首先,在六個數據集-模型對上,涵蓋邏輯回歸、MLP 和 ResNet-9,我們的估計量在所有設置中都顯著優於一階影響,能更好地跟蹤離群重訓。其次,當用作 Llama-3.1-8B 上指令調整數據的貪婪選擇規則時,它在七個下游任務中的五個任務上超越了先前基於影響和表示相似性的基準,在標準基於影響的選擇表現不如隨機選擇的情況下。

VLMs Trace Without Tracking: Diagnosing Failures in Visual Path Following

2605.15672v1 by Hyesoo Hong, Minsoo Kim, Wonje Jeung, Sangyeon Yoon, Dongjae Jeon, Albert No

Vision-language models (VLMs) achieve strong performance on multimodal benchmarks, but may still lack robust control over basic visual operations. We study \textit{line tracing}, where a model must follow a selected visual path through successive local continuations. To isolate this ability, we design controlled tracing tasks that introduce nearby competitors while reducing semantic and topological ambiguity such as crossings and overlaps. Across these tasks, even state-of-the-art VLMs frequently lose the target path and switch to nearby alternatives, especially when those alternatives look locally similar to the target. Behavioral interventions and internal analyses indicate that these failures arise from local competition: nearby similar distractors pull the model away from the true continuation. Standard remedies do not remove this bottleneck: model-size scaling provides only limited gains, reasoning partially compensates through costly substitute strategies, and explicit tracing instructions fail to recover stable path following. Finally, tests on tangled-cable scenes and metro maps with richer visual complexity show that the same path-switching failure persists beyond our controlled settings.

摘要:視覺語言模型(VLMs)在多模態基準上表現出色,但仍可能缺乏對基本視覺操作的穩健控制。我們研究\textit{線條追蹤},其中模型必須通過連續的局部延續來跟隨選定的視覺路徑。為了孤立這種能力,我們設計了控制追蹤任務,這些任務引入了附近的競爭者,同時減少語義和拓撲的模糊性,例如交叉和重疊。在這些任務中,即使是最先進的VLMs也經常偏離目標路徑,轉而選擇附近的替代路徑,特別是當這些替代路徑在局部上看起來與目標相似時。行為干預和內部分析表明,這些失敗源於局部競爭:附近相似的干擾物使模型偏離真實的延續。標準的補救措施並未消除這一瓶頸:模型規模擴展僅提供有限的增益,推理部分通過代價高昂的替代策略進行補償,而明確的追蹤指令無法恢復穩定的路徑跟隨。最後,在纏繞電纜場景和具有更豐富視覺複雜性的地鐵地圖上的測試顯示,這種路徑切換失敗在我們的控制環境之外仍然持續存在。

PRISM: Prompt Reliability via Iterative Simulation and Monitoring for Enterprise Conversational AI

2605.15665v1 by Keshava Chaitanya, Jahnavi Gundakaram

Deploying large language model (LLM)-driven conversational agents in enterprise settings requires prompts that are simultaneously correct at launch and resilient to the non-deterministic behavioral drift that characterizes production LLM deployments. Existing prompt optimization frameworks address prompt quality as a one-time compile-time problem, leaving open the equally critical question of how to detect and repair prompt regressions caused by silent LLM behavior changes over time. We present PRISM (Prompt Reliability via Iterative Simulation and Monitoring), a closed-loop framework that treats prompt engineering as a continuous reliability engineering problem rather than a one-time authorship task. PRISM takes as input plain-language agent requirements, a set of configured tools and memory variables, and an initial draft prompt. It automatically generates test cases from requirements, simulates full multi-turn conversations against a platform-faithful LLM environment, evaluates pass/fail using an LLM-as-judge, diagnoses root causes of failures, and surgically repairs the prompt -- iterating until all tests pass. Critically, PRISM is designed to run on a scheduled basis (daily), treating LLM behavioral drift as a first-class reliability concern. We evaluate PRISM across 35 enterprise conversational agents over a three-week deployment period on the Yellow.ai V3 platform. PRISM reduces median prompt authoring time from 2 days to under 30 minutes, achieves 99% production reliability across all evaluated agents, and successfully identifies and repairs production regressions caused by LLM behavioral drift within a 24-hour detection window. Our results suggest that continuous, simulation-driven prompt optimization is both tractable and necessary for reliable enterprise conversational AI at scale.

摘要:在企業環境中部署大型語言模型(LLM)驅動的對話代理需要同時在啟動時正確且對生產LLM部署中非確定性行為漂移具有韌性的提示。現有的提示優化框架將提示質量視為一次性的編譯時問題,卻未解決同樣關鍵的問題,即如何檢測和修復由於隱性LLM行為變化而引起的提示回歸。我們提出了PRISM(通過迭代模擬和監控的提示可靠性),這是一個封閉循環框架,將提示工程視為一個持續的可靠性工程問題,而非一次性的創作任務。PRISM以普通語言的代理需求、一組配置的工具和記憶變數以及初始草稿提示作為輸入。它自動從需求生成測試案例,模擬針對平台忠實LLM環境的完整多輪對話,使用LLM作為評判進行通過/失敗評估,診斷失敗的根本原因,並精確修復提示——不斷迭代直到所有測試通過。關鍵是,PRISM被設計為定期運行(每日),將LLM行為漂移視為一個一級的可靠性問題。我們在Yellow.ai V3平台上對35個企業對話代理進行了為期三週的部署期間評估PRISM。PRISM將中位數提示創作時間從2天減少到30分鐘以內,實現了所有評估代理的99%生產可靠性,並成功識別和修復了在24小時檢測窗口內由LLM行為漂移引起的生產回歸。我們的結果表明,持續的、基於模擬的提示優化在大規模可靠企業對話AI中是可行且必要的。

VAGS: Velocity Adaptive Guidance Scale for Image Editing and Generation

2605.15661v1 by Yan Luo, Ahmadou Aidara, Jingyi Lu, Jeremy Moebel, Kai Han, Mengyu Wang

Classifier-free guidance (CFG) is the primary control over how strongly text semantics move a flow-based sampler, yet standard practice holds its scale fixed across the entire ODE trajectory. This is a fundamental mismatch: early steps are noise-dominated and carry weak semantic signal, while late steps commit image structure and demand stronger directional commitment; more critically, the value of any guidance strength depends on whether the guided velocity is consistent with the model's current dynamics or working against them. We propose \textit{Velocity-Adaptive Guidance Scale} (VAGS), a training-free replacement that multiplies the nominal scale by a bounded factor combining a temporal signal-level term with the cosine similarity between task-relevant velocity fields. For inversion-free editing, VAGS measures the alignment between source- and target-guided velocities, so edit strength at each step reflects local compatibility between preservation and transformation. For generation, VAGS-Gen uses the alignment between unconditional and conditional velocities as the analogous signal. Neither variant requires fine-tuning, auxiliary networks, or extra forward passes, and fixed CFG is recovered as a special case. On PIE-Bench and DIV2K for editing, and COCO17, CUB-200, and Flickr30K for generation, VAGS consistently improves structural fidelity and generation quality over fixed CFG and recent training-free guidance variants. The code is publicly available at https://github.com/Harvard-AI-and-Robotics-Lab/Velocity_Adaptive_Guidance_Scale.

摘要:無分類器引導(CFG)是控制文本語義如何強烈影響基於流的取樣器的主要手段,但標準做法是在整個ODE軌跡中保持其比例不變。這是一個根本的不匹配:早期步驟受到噪聲主導,並攜帶弱語義信號,而晚期步驟則承擔圖像結構並要求更強的方向性承諾;更關鍵的是,任何引導強度的價值取決於引導速度是否與模型當前的動態一致或相對抗。我們提出了\textit{速度自適應引導比例}(VAGS),這是一種無需訓練的替代方案,通過結合時間信號級別項與任務相關速度場之間的餘弦相似度來將名義比例乘以一個有界因子。對於無反演編輯,VAGS測量源引導和目標引導速度之間的對齊,因此每一步的編輯強度反映了保留與轉換之間的局部兼容性。對於生成,VAGS-Gen使用無條件和有條件速度之間的對齊作為類似信號。這兩種變體都不需要微調、輔助網絡或額外的前向傳遞,固定的CFG作為特例被恢復。在編輯方面的PIE-Bench和DIV2K,以及生成方面的COCO17、CUB-200和Flickr30K上,VAGS在結構保真度和生成質量上始終優於固定CFG和最近的無訓練引導變體。代碼可在https://github.com/Harvard-AI-and-Robotics-Lab/Velocity_Adaptive_Guidance_Scale公開獲得。