Skip to content

arxiv-daily

Automated deployment @ 2026-04-05 20:50:21 Asia/Taipei

Welcome to contribute! Add your topics and keywords in topic.yml. You can also view historical data through the storage.

AI

Knowledge Graphs

Publish Date Title Authors Homepage Code
2026-04-02 Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation Daiwei Chen et.al. 2604.02324v1 null
2026-04-02 Crystalite: A Lightweight Transformer for Efficient Crystal Modeling Tin Hadži Veljković et.al. 2604.02270v1 null
2026-04-02 Retrieval-Augmented Question Answering over Scientific Literature for the Electron-Ion Collider Tina. J. Jat et.al. 2604.02259v1 null
2026-04-02 When to ASK: Uncertainty-Gated Language Assistance for Reinforcement Learning Juarez Monteiro et.al. 2604.02226v1 null
2026-04-02 Universal Hypernetworks for Arbitrary Models Xuanfeng Zhou et.al. 2604.02215v1 null
2026-04-02 LEO: Graph Attention Network based Hybrid Multi Sensor Extended Object Fusion and Tracking for Autonomous Driving Applications Mayank Mayank et.al. 2604.02206v1 null
2026-04-02 Neuro-RIT: Neuron-Guided Instruction Tuning for Robust Retrieval-Augmented Language Model Jaemin Kim et.al. 2604.02194v1 null
2026-04-02 TRU: Targeted Reverse Update for Efficient Multimodal Recommendation Unlearning Zhanting Zhou et.al. 2604.02183v1 null
2026-04-02 Adam's Law: Textual Frequency Law on Large Language Models Hongyuan Adam Lu et.al. 2604.02176v1 null
2026-04-02 GaelEval: Benchmarking LLM Performance for Scottish Gaelic Peter Devine et.al. 2604.02135v1 null
2026-04-02 Optimizing RAG Rerankers with LLM Feedback via Reinforcement Learning Yuhang Wu et.al. 2604.02091v1 null
2026-04-02 Mining Instance-Centric Vision-Language Contexts for Human-Object Interaction Detection Soo Won Seo et.al. 2604.02071v1 null
2026-04-02 Diff-KD: Diffusion-based Knowledge Distillation for Collaborative Perception under Corruptions Pengcheng Lyu et.al. 2604.02061v1 null
2026-04-02 SAFE: Stepwise Atomic Feedback for Error correction in Multi-hop Reasoning Daeyong Kwon et.al. 2604.01993v1 null
2026-04-02 Do We Need Bigger Models for Science? Task-Aware Retrieval with Small Language Models Florian Kelber et.al. 2604.01965v1 null
2026-04-02 Abnormal Head Movements in Neurological Conditions: A Knowledge-Based Dataset with Application to Cervical Dystonia Saja Al-Dabet et.al. 2604.01962v1 null
2026-04-02 How to measure the optimality of word or gesture order with respect to the principle of swap distance minimization Ramon Ferrer-i-Cancho et.al. 2604.01938v1 null
2026-04-02 CANDI: Curated Test-Time Adaptation for Multivariate Time-Series Anomaly Detection Under Distribution Shift HyunGi Kim et.al. 2604.01845v1 null
2026-04-02 Domain-constrained knowledge representation: A modal framework Chao Li et.al. 2604.01770v1 null
2026-04-02 FSKD: Monocular Forest Structure Inference via LiDAR-to-RGBI Knowledge Distillation Taimur Khan et.al. 2604.01766v1 null
2026-04-02 AeroTherm-GPT: A Verification-Centered LLM Framework for Thermal Protection System Engineering Workflows Chuhan Qiao et.al. 2604.01738v1 null
2026-04-02 The AnIML Ontology: Enabling Semantic Interoperability for Large-Scale Experimental Data in Interconnected Scientific Labs Wilf Morlidge et.al. 2604.01728v1 null
2026-04-02 LiteInception: A Lightweight and Interpretable Deep Learning Framework for General Aviation Fault Diagnosis Zhihuan Wei et.al. 2604.01725v1 null
2026-04-02 Human-Guided Reasoning with Large Language Models for Vietnamese Speech Emotion Recognition Truc Nguyen et.al. 2604.01711v1 null
2026-04-02 Memory in the LLM Era: Modular Architectures and Strategies in a Unified Framework Yanchen Wu et.al. 2604.01707v1 null
2026-04-02 MiCA Learns More Knowledge Than LoRA and Full Fine-Tuning Sten Rüdiger et.al. 2604.01694v1 null
2026-04-02 PRISM: Probability Reallocation with In-Span Masking for Knowledge-Sensitive Alignment Chenning Xu et.al. 2604.01682v1 null
2026-04-02 Can Heterogeneous Language Models Be Fused? Shilian Chen et.al. 2604.01674v1 null
2026-04-02 PRCCF: A Persona-guided Retrieval and Causal-aware Cognitive Filtering Framework for Emotional Support Conversation Yanxin Luo et.al. 2604.01671v1 null
2026-04-02 Hierarchical Memory Orchestration for Personalized Persistent Agents Junming Liu et.al. 2604.01670v1 null
2026-04-02 Robust Embodied Perception in Dynamic Environments via Disentangled Weight Fusion Juncen Guo et.al. 2604.01669v1 null
2026-04-02 M3D-BFS: a Multi-stage Dynamic Fusion Strategy for Sample-Adaptive Multi-Modal Brain Network Analysis Rui Dong et.al. 2604.01667v1 null
2026-04-02 CORAL: Towards Autonomous Multi-Agent Evolution for Open-Ended Discovery Ao Qu et.al. 2604.01658v1 null
2026-04-02 AromaGen: Interactive Generation of Rich Olfactory Experiences with Multimodal Language Models Yunge Wen et.al. 2604.01650v1 null
2026-04-02 Exploring Robust Multi-Agent Workflows for Environmental Data Management Boyuan Guan et.al. 2604.01647v1 null
2026-04-02 CRIT: Graph-Based Automatic Data Synthesis to Enhance Cross-Modal Multi-Hop Reasoning Junyoung Sung et.al. 2604.01634v1 null
2026-04-02 GraphWalk: Enabling Reasoning in Large Language Models through Tool-Based Graph Navigation Taraneh Ghandi et.al. 2604.01610v1 null
2026-04-02 From Multi-Agent to Single-Agent: When Is Skill Distillation Beneficial? Binyan Xu et.al. 2604.01608v1 null
2026-04-02 ByteRover: Agent-Native Memory Through LLM-Curated Hierarchical Context Andy Nguyen et.al. 2604.01599v1 null
2026-04-02 Do Large Language Models Mentalize When They Teach? Sevan K. Harootonian et.al. 2604.01594v1 null
2026-04-02 A Role-Based LLM Framework for Structured Information Extraction from Healthy Food Policies Congjing Zhang et.al. 2604.01529v1 null
2026-04-01 A Self-Evolving Agentic Framework for Metasurface Inverse Design Yi Huang et.al. 2604.01480v1 null
2026-04-01 Can LLMs Predict Academic Collaboration? Topology Heuristics vs. LLM-Based Link Prediction on Real Co-authorship Networks Fan Huang et.al. 2604.01379v1 null
2026-04-01 No Attacker Needed: Unintentional Cross-User Contamination in Shared-State LLM Agents Tiankai Yang et.al. 2604.01350v1 null
2026-04-01 Procedural Knowledge at Scale Improves Reasoning Di Wu et.al. 2604.01348v1 null
2026-04-01 IDEA2: Expert-in-the-loop competency question elicitation for collaborative ontology engineering Elliott Watkiss-Leek et.al. 2604.01344v1 null
2026-04-01 Look Twice: Training-Free Evidence Highlighting in Multimodal Large Language Models Marco Morini et.al. 2604.01280v1 null
2026-04-01 Brainstacks: Cross-Domain Cognitive Capabilities via Frozen MoE-LoRA Stacks for Continual LLM Learning Mohammad R. Abu Ayyash et.al. 2604.01152v1 null
2026-04-01 Looking into a Pixel by Nonlinear Unmixing -- A Generative Approach Maofeng Tang et.al. 2604.01141v1 null
2026-04-01 Lightweight Prompt-Guided CLIP Adaptation for Monocular Depth Estimation Reyhaneh Ahani Manghotay et.al. 2604.01118v1 null
2026-04-01 Revision or Re-Solving? Decomposing Second-Pass Gains in Multi-LLM Pipelines Jingjie Ning et.al. 2604.01029v1 null
2026-04-01 Bridging Structured Knowledge and Data: A Unified Framework with Finance Applications Yi Cao et.al. 2604.00987v1 null
2026-04-01 Experience as a Compass: Multi-agent RAG with Evolving Orchestration and Agent Prompts Sha Li et.al. 2604.00901v1 null
2026-04-01 Transforming OPACs into Intelligent Discovery Systems: An AI-Powered, Knowledge Graph-Driven Smart OPAC for Digital Libraries M. S. Rajeevan et.al. 2604.01262v1 null
2026-04-01 LinguDistill: Recovering Linguistic Ability in Vision- Language Models via Selective Cross-Modal Distillation Patrick Amadeus Irawan et.al. 2604.00829v1 null
2026-04-01 From Early Encoding to Late Suppression: Interpreting LLMs on Character Counting Tasks Ayan Datta et.al. 2604.00778v1 null
2026-04-01 BioCOMPASS: Integrating Biomarkers into Transformer-Based Immunotherapy Response Prediction Sayed Hashim et.al. 2604.00739v1 null
2026-04-01 To Memorize or to Retrieve: Scaling Laws for RAG-Considerate Pretraining Karan Singh et.al. 2604.00715v1 null
2026-04-01 Internal APIs Are All You Need: Shadow APIs, Shared Discovery, and the Case Against Browser-First Agent Architectures Lewis Tham et.al. 2604.00694v1 null
2026-04-01 A Survey of On-Policy Distillation for Large Language Models Mingyang Song et.al. 2604.00626v1 null
2026-04-01 Speech LLMs are Contextual Reasoning Transcribers Keqi Deng et.al. 2604.00610v1 null
2026-04-01 Ontology-Constrained Neural Reasoning in Enterprise Agentic Systems: A Neurosymbolic Architecture for Domain-Grounded AI Agents Thanh Luong Tuan et.al. 2604.00555v1 null
2026-04-01 Optimsyn: Influence-Guided Rubrics Optimization for Synthetic Data Generation Zhiting Fan et.al. 2604.00536v1 null
2026-04-01 Polysemanticity or Polysemy? Lexical Identity Confounds Superposition Metrics Iyad Ait Hou et.al. 2604.00443v1 null
2026-04-01 TR-ICRL: Test-Time Rethinking for In-Context Reinforcement Learning Wenxuan Jiang et.al. 2604.00438v1 null
2026-04-01 COTTA: Context-Aware Transfer Adaptation for Trajectory Prediction in Autonomous Driving Seohyoung Park et.al. 2604.00402v1 null
2026-04-01 RAGShield: Provenance-Verified Defense-in-Depth Against Knowledge Base Poisoning in Government Retrieval-Augmented Generation Systems KrishnaSaiReddy Patil et.al. 2604.00387v1 null
2026-04-01 Agent Q-Mix: Selecting the Right Action for LLM Multi-Agent Systems through Reinforcement Learning Eric Hanchen Jiang et.al. 2604.00344v1 null
2026-03-31 Improvisational Games as a Benchmark for Social Intelligence of AI Agents: The Case of Connections Gaurav Rajesh Parikh et.al. 2604.00284v1 null
2026-03-31 A Study on the Impact of Fault localization Granularity for Repository-Scale Code Repair Tasks Joseph Townsend et.al. 2604.00167v1 null
2026-03-31 Epileptic Seizure Detection in Separate Frequency Bands Using Feature Analysis and Graph Convolutional Neural Network (GCN) from Electroencephalogram (EEG) Signals Ferdaus Anam Jibon et.al. 2604.00163v1 null
2026-03-31 From Domain Understanding to Design Readiness: a playbook for GenAI-supported learning in Software Engineering Rafal Wlodarski et.al. 2604.00120v1 null
2026-03-31 ContextClaim: A Context-Driven Paradigm for Verifiable Claim Detection Yufeng Li et.al. 2603.30025v1 null
2026-03-31 Physiological and Semantic Patterns in Medical Teams Using an Intelligent Tutoring System Xiaoshan Huang et.al. 2603.29950v1 null
2026-03-31 End-to-End Image Compression with Segmentation Guided Dual Coding for Wind Turbines Raül Pérez-Gonzalo et.al. 2603.29927v1 null
2026-03-31 SNEAK: Evaluating Strategic Communication and Information Leakage in Large Language Models Adar Avsian et.al. 2603.29846v1 null
2026-03-31 DIAL: Decoupling Intent and Action via Latent World Modeling for End-to-End VLA Yi Chen et.al. 2603.29844v1 null
2026-03-31 ENEIDE: A High Quality Silver Standard Dataset for Named Entity Recognition and Linking in Historical Italian Cristian Santini et.al. 2603.29801v1 null
2026-03-31 Training-Free Dynamic Upcycling of Expert Language Models Eros Fanì et.al. 2603.29765v1 null
2026-03-31 KEditVis: A Visual Analytics System for Knowledge Editing of Large Language Models Zhenning Chen et.al. 2603.29689v1 null
2026-03-31 Beyond the Steeper Curve: AI-Mediated Metacognitive Decoupling and the Limits of the Dunning-Kruger Metaphor Christopher Koch et.al. 2603.29681v1 null
2026-03-31 A Comprehensive Information-Decomposition Analysis of Large Vision-Language Models Lixin Xiu et.al. 2603.29676v1 null
2026-03-31 ASI-Evolve: AI Accelerates AI Weixian Xu et.al. 2603.29640v1 null
2026-03-31 FlowPIE: Test-Time Scientific Idea Evolution with Flow-Guided Literature Exploration Qiyao Wang et.al. 2603.29557v1 null
2026-03-31 Bringing Up a Bilingual BabyLM: Investigating Multilingual Language Acquisition Using Small-Scale Models Linda Zeng et.al. 2603.29552v1 null
2026-03-31 Quantization with Unified Adaptive Distillation to enable multi-LoRA based one-for-all Generative Vision Models on edge Sowmya Vajrala et.al. 2603.29535v1 null
2026-03-31 Baby Scale: Investigating Models Trained on Individual Children's Language Input Steven Y. Feng et.al. 2603.29522v1 null
2026-03-31 Impact of enriched meaning representations for language generation in dialogue tasks: A comprehensive exploration of the relevance of tasks, corpora and metrics Alain Vázquez et.al. 2603.29518v1 null
2026-03-31 Structural Compactness as a Complementary Criterion for Explanation Quality Mohammad Mahdi Mesgari et.al. 2603.29491v1 null
2026-03-31 iPoster: Content-Aware Layout Generation for Interactive Poster Design via Graph-Enhanced Diffusion Models Xudong Zhou et.al. 2603.29469v1 null
2026-03-31 PRISM: PRIor from corpus Statistics for topic Modeling Tal Ishon et.al. 2603.29406v1 null
2026-03-31 Security in LLM-as-a-Judge: A Comprehensive SoK Aiman Almasoud et.al. 2603.29403v1 null
2026-03-31 Beyond Idealized Patients: Evaluating LLMs under Challenging Patient Behaviors in Medical Consultations Yahan Li et.al. 2603.29373v1 null
2026-03-31 L-ReLF: A Framework for Lexical Dataset Creation Anass Sedrati et.al. 2603.29346v1 null
2026-03-31 CADEL: A Corpus of Administrative Web Documents for Japanese Entity Linking Shohei Higashiyama et.al. 2603.29336v1 null
2026-03-31 PRISM: A Multi-View Multi-Capability Retail Video Dataset for Embodied Vision-Language Models Amirreza Rouhi et.al. 2603.29281v1 null
2026-03-31 Long-Document QA with Chain-of-Structured-Thought and Fine-Tuned SLMs Zhuowen Liang et.al. 2603.29232v1 null
2026-03-31 Software Vulnerability Detection Using a Lightweight Graph Neural Network Miles Farmer et.al. 2603.29216v1 null
2026-03-31 Towards Automatic Soccer Commentary Generation with Knowledge-Enhanced Visual Reasoning Zeyu Jin et.al. 2604.00057v1 null
2026-03-31 Knowledge database development by large language models for countermeasures against viruses and marine toxins Hung N. Do et.al. 2603.29149v1 null

Abstracts

Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation

2604.02324v1 by Daiwei Chen, Zhoutong Fu, Chengming Jiang, Haichao Zhang, Ran Zhou, Tan Wang, Chunnan Yao, Guoyao Li, Rui Cai, Yihan Cao, Ruijie Jiang, Fedor Borisyuk, Jianqiang Shen, Jingwei Wu, Ramya Korlakai Vinayak

Language models (LMs) are increasingly extended with new learnable vocabulary tokens for domain-specific tasks, such as Semantic-ID tokens in generative recommendation. The standard practice initializes these new tokens as the mean of existing vocabulary embeddings, then relies on supervised fine-tuning to learn their representations. We present a systematic analysis of this strategy: through spectral and geometric diagnostics, we show that mean initialization collapses all new tokens into a degenerate subspace, erasing inter-token distinctions that subsequent fine-tuning struggles to fully recover. These findings suggest that \emph{token initialization} is a key bottleneck when extending LMs with new vocabularies. Motivated by this diagnosis, we propose the \emph{Grounded Token Initialization Hypothesis}: linguistically grounding novel tokens in the pretrained embedding space before fine-tuning better enables the model to leverage its general-purpose knowledge for novel-token domains. We operationalize this hypothesis as GTI (Grounded Token Initialization), a lightweight grounding stage that, prior to fine-tuning, maps new tokens to distinct, semantically meaningful locations in the pretrained embedding space using only paired linguistic supervision. Despite its simplicity, GTI outperforms both mean initialization and existing auxiliary-task adaptation methods in the majority of evaluation settings across multiple generative recommendation benchmarks, including industry-scale and public datasets. Further analyses show that grounded embeddings produce richer inter-token structure that persists through fine-tuning, corroborating the hypothesis that initialization quality is a key bottleneck in vocabulary extension.

摘要:語言模型(LMs)越來越多地擴展了新的可學習詞彙標記,以應對特定領域的任務,例如生成推薦中的語義識別(Semantic-ID)標記。標準做法是將這些新標記初始化為現有詞彙嵌入的均值,然後依賴監督性微調來學習它們的表示。我們對這一策略進行了系統分析:通過光譜和幾何診斷,我們顯示均值初始化將所有新標記壓縮到一個退化子空間中,抹去了標記之間的區別,這使得隨後的微調難以完全恢復。這些發現表明,\emph{標記初始化}是在擴展LMs時引入新詞彙的關鍵瓶頸。受到這一診斷的啟發,我們提出了\emph{基於語言的標記初始化假設}:在微調之前,將新標記在預訓練嵌入空間中進行語言學上的基礎化,更能幫助模型利用其通用知識來應對新標記領域。我們將這一假設具體化為GTI(基於語言的標記初始化),這是一個輕量級的基礎化階段,在微調之前,僅使用配對的語言監督,將新標記映射到預訓練嵌入空間中不同的、語義上有意義的位置。儘管其簡單性,GTI在多數評估設置中超越了均值初始化和現有的輔助任務適應方法,涵蓋了多個生成推薦基準,包括行業規模和公共數據集。進一步分析顯示,基於語言的嵌入產生了更豐富的標記間結構,並在微調過程中持續存在,證實了初始化質量是詞彙擴展中的關鍵瓶頸的假設。

Crystalite: A Lightweight Transformer for Efficient Crystal Modeling

2604.02270v1 by Tin Hadži Veljković, Joshua Rosenthal, Ivor Lončarić, Jan-Willem van de Meent

Generative models for crystalline materials often rely on equivariant graph neural networks, which capture geometric structure well but are costly to train and slow to sample. We present Crystalite, a lightweight diffusion Transformer for crystal modeling built around two simple inductive biases. The first is Subatomic Tokenization, a compact chemically structured atom representation that replaces high-dimensional one-hot encodings and is better suited to continuous diffusion. The second is the Geometry Enhancement Module (GEM), which injects periodic minimum-image pair geometry directly into attention through additive geometric biases. Together, these components preserve the simplicity and efficiency of a standard Transformer while making it better matched to the structure of crystalline materials. Crystalite achieves state-of-the-art results on crystal structure prediction benchmarks, and de novo generation performance, attaining the best S.U.N. discovery score among the evaluated baselines while sampling substantially faster than geometry-heavy alternatives.

摘要:生成晶體材料的模型通常依賴於等變圖神經網絡,這些網絡能夠很好地捕捉幾何結構,但訓練成本高且取樣速度慢。我們提出了Crystalite,一種輕量級的擴散Transformer,用於晶體建模,基於兩個簡單的歸納偏見。第一個是亞原子標記化,一種緊湊的化學結構原子表示,取代了高維的獨熱編碼,更適合連續擴散。第二個是幾何增強模塊(GEM),通過附加幾何偏見,將周期性最小影像對幾何直接注入注意力中。這些組件共同保持了標準Transformer的簡單性和效率,同時使其更適合晶體材料的結構。Crystalite在晶體結構預測基準測試和新穎生成性能上達到了最先進的結果,在評估的基準中獲得了最佳的S.U.N.發現分數,同時取樣速度顯著快於以幾何為重的替代方案。

Retrieval-Augmented Question Answering over Scientific Literature for the Electron-Ion Collider

2604.02259v1 by Tina. J. Jat, T. Ghosh, Karthik Suresh

To harness the power of Language Models in answering domain specific specialized technical questions, Retrieval Augmented Generation (RAG) is been used widely. In this work, we have developed a Q\&A application inspired by the Retrieval Augmented Generation (RAG), which is comprised of an in-house database indexed on the arXiv articles related to the Electron-Ion Collider (EIC) experiment - one of the largest international scientific collaboration and incorporated an open-source LLaMA model for answer generation. This is an extension to it's proceeding application built on proprietary model and Cloud-hosted external knowledge-base for the EIC experiment. This locally-deployed RAG-system offers a cost-effective, resource-constraint alternative solution to build a RAG-assisted Q\&A application on answering domain-specific queries in the field of experimental nuclear physics. This set-up facilitates data-privacy, avoids sending any pre-publication scientific data and information to public domain. Future improvement will expand the knowledge base to encompass heterogeneous EIC-related publications and reports and upgrade the application pipeline orchestration to the LangGraph framework.

摘要:為了利用語言模型的力量來回答特定領域的專業技術問題,檢索增強生成(RAG)被廣泛使用。在這項工作中,我們開發了一個受到檢索增強生成(RAG)啟發的問答應用程式,該應用程式由一個內部數據庫組成,該數據庫索引了與電子-離子對撞機(EIC)實驗相關的arXiv文章——這是最大的國際科學合作之一,並結合了一個開源的LLaMA模型來生成答案。這是對其先前應用的擴展,該應用基於專有模型和雲端托管的外部知識庫,用於EIC實驗。這個本地部署的RAG系統提供了一種具成本效益的資源受限替代方案,以建立一個RAG輔助的問答應用程式,回答實驗核物理領域的特定查詢。這一設置促進了數據隱私,避免將任何未發表的科學數據和信息發送到公共領域。未來的改進將擴展知識庫,以涵蓋異質的EIC相關出版物和報告,並將應用管道編排升級到LangGraph框架。

When to ASK: Uncertainty-Gated Language Assistance for Reinforcement Learning

2604.02226v1 by Juarez Monteiro, Nathan Gavenski, Gianlucca Zuin, Adriano Veloso

Reinforcement learning (RL) agents often struggle with out-of-distribution (OOD) scenarios, leading to high uncertainty and random behavior. While language models (LMs) contain valuable world knowledge, larger ones incur high computational costs, hindering real-time use, and exhibit limitations in autonomous planning. We introduce Adaptive Safety through Knowledge (ASK), which combines smaller LMs with trained RL policies to enhance OOD generalization without retraining. ASK employs Monte Carlo Dropout to assess uncertainty and queries the LM for action suggestions only when uncertainty exceeds a set threshold. This selective use preserves the efficiency of existing policies while leveraging the language model's reasoning in uncertain situations. In experiments on the FrozenLake environment, ASK shows no improvement in-domain, but demonstrates robust navigation in transfer tasks, achieving a reward of 0.95. Our findings indicate that effective neuro-symbolic integration requires careful orchestration rather than simple combination, highlighting the need for sufficient model scale and effective hybridization mechanisms for successful OOD generalization.

摘要:強化學習(RL)代理在處理分佈外(OOD)情境時常常面臨困難,導致高度的不確定性和隨機行為。雖然語言模型(LM)包含有價值的世界知識,但較大的模型會產生高計算成本,妨礙實時使用,並在自主規劃方面顯示出限制。我們引入了通過知識的自適應安全(ASK),它將較小的LM與訓練過的RL策略結合,以增強OOD泛化而無需重新訓練。ASK採用蒙特卡羅隨機失活來評估不確定性,並僅在不確定性超過設定閾值時查詢LM以獲取行動建議。這種選擇性使用保留了現有策略的效率,同時利用語言模型在不確定情況下的推理能力。在FrozenLake環境的實驗中,ASK在領域內沒有顯示出改善,但在轉移任務中顯示出穩健的導航,獲得了0.95的獎勵。我們的研究結果表明,有效的神經符號整合需要謹慎的協調,而非簡單的組合,突顯了成功的OOD泛化所需的足夠模型規模和有效的混合機制。

Universal Hypernetworks for Arbitrary Models

2604.02215v1 by Xuanfeng Zhou

Conventional hypernetworks are typically engineered around a specific base-model parameterization, so changing the target architecture often entails redesigning the hypernetwork and retraining it from scratch. We introduce the \emph{Universal Hypernetwork} (UHN), a fixed-architecture generator that predicts weights from deterministic parameter, architecture, and task descriptors. This descriptor-based formulation decouples the generator architecture from target-network parameterization, so one generator can instantiate heterogeneous models across the tested architecture and task families. Our empirical claims are threefold: (1) one fixed UHN remains competitive with direct training across vision, graph, text, and formula-regression benchmarks; (2) the same UHN supports both multi-model generalization within a family and multi-task learning across heterogeneous models; and (3) UHN enables stable recursive generation with up to three intermediate generated UHNs before the final base model. Our code is available at https://github.com/Xuanfeng-Zhou/UHN.

摘要:傳統的超網絡通常是圍繞特定的基礎模型參數化而設計,因此更改目標架構通常需要重新設計超網絡並從頭開始進行訓練。我們介紹了\emph{通用超網絡}(UHN),這是一個固定架構的生成器,能夠根據確定性的參數、架構和任務描述符預測權重。這種基於描述符的公式將生成器架構與目標網絡參數化解耦,這樣一個生成器就可以在測試的架構和任務家族中實例化異質模型。我們的實證主張有三個方面:(1)一個固定的UHN在視覺、圖形、文本和公式回歸基準測試中與直接訓練保持競爭力;(2)相同的UHN支持在一個家族內的多模型泛化以及跨異質模型的多任務學習;(3)UHN使得在最終基礎模型之前能夠穩定地進行多達三個中間生成的UHN的遞歸生成。我們的代碼可在https://github.com/Xuanfeng-Zhou/UHN獲得。

LEO: Graph Attention Network based Hybrid Multi Sensor Extended Object Fusion and Tracking for Autonomous Driving Applications

2604.02206v1 by Mayank Mayank, Bharanidhar Duraisamy, Florian Geiss

Accurate shape and trajectory estimation of dynamic objects is essential for reliable automated driving. Classical Bayesian extended-object models offer theoretical robustness and efficiency but depend on completeness of a-priori and update-likelihood functions, while deep learning methods bring adaptability at the cost of dense annotations and high compute. We bridge these strengths with LEO (Learned Extension of Objects), a spatio-temporal Graph Attention Network that fuses multi-modal production-grade sensor tracks to learn adaptive fusion weights, ensure temporal consistency, and represent multi-scale shapes. Using a task-specific parallelogram ground-truth formulation, LEO models complex geometries (e.g. articulated trucks and trailers) and generalizes across sensor types, configurations, object classes, and regions, remaining robust for challenging and long-range targets. Evaluations on the Mercedes-Benz DRIVE PILOT SAE L3 dataset demonstrate real-time computational efficiency suitable for production systems; additional validation on public datasets such as View of Delft (VoD) further confirms cross-dataset generalization.

摘要:準確的動態物體形狀和軌跡估計對於可靠的自動駕駛至關重要。傳統的貝葉斯擴展物體模型提供了理論上的穩健性和效率,但依賴於先驗和更新似然函數的完整性,而深度學習方法則以密集標註和高計算成本為代價帶來了適應性。我們通過LEO(物體的學習擴展)架起這些優勢的橋樑,這是一種時空圖注意力網絡,融合多模態生產級傳感器軌跡以學習自適應融合權重,確保時間一致性並表示多尺度形狀。使用特定任務的平行四邊形真實值公式,LEO建模複雜的幾何形狀(例如關節式卡車和拖車),並在傳感器類型、配置、物體類別和區域之間進行泛化,對於挑戰性和長距離目標保持穩健性。在梅賽德斯-奔馳DRIVE PILOT SAE L3數據集上的評估展示了適合生產系統的實時計算效率;在公共數據集如代爾夫特視圖(VoD)上的額外驗證進一步確認了跨數據集的泛化能力。

Neuro-RIT: Neuron-Guided Instruction Tuning for Robust Retrieval-Augmented Language Model

2604.02194v1 by Jaemin Kim, Jae O Lee, Sumyeong Ahn, Seo Yeon Park

Retrieval-Augmented Language Models (RALMs) have demonstrated significant potential in knowledge-intensive tasks; however, they remain vulnerable to performance degradation when presented with irrelevant or noisy retrieved contexts. Existing approaches to enhance robustness typically operate via coarse-grained parameter updates at the layer or module level, often overlooking the inherent neuron-level sparsity of Large Language Models (LLMs). To address this limitation, we propose Neuro-RIT (Neuron-guided Robust Instruction Tuning), a novel framework that shifts the paradigm from dense adaptation to precision-driven neuron alignment. Our method explicitly disentangles neurons that are responsible for processing relevant versus irrelevant contexts using attribution-based neuron mining. Subsequently, we introduce a two-stage instruction tuning strategy that enforces a dual capability for noise robustness: achieving direct noise suppression by functionally deactivating neurons exclusive to irrelevant contexts, while simultaneously optimizing targeted layers for evidence distillation. Extensive experiments across diverse QA benchmarks demonstrate that Neuro-RIT consistently outperforms strong baselines and robustness-enhancing methods.

摘要:檢索增強語言模型(RALMs)在知識密集型任務中顯示出顯著的潛力;然而,當面對不相關或噪聲的檢索上下文時,它們仍然容易出現性能下降。現有增強穩健性的方案通常通過層或模塊級的粗粒度參數更新來運作,往往忽略了大型語言模型(LLMs)固有的神經元級稀疏性。為了解決這一限制,我們提出了神經引導穩健指令調整(Neuro-RIT),這是一個新穎的框架,將範式從密集適應轉變為以精確驅動的神經元對齊。我們的方法明確區分了負責處理相關與不相關上下文的神經元,使用基於歸因的神經元挖掘。隨後,我們引入了一種兩階段的指令調整策略,強化了噪聲穩健性的雙重能力:通過功能性地停用專門處理不相關上下文的神經元來實現直接的噪聲抑制,同時優化針對證據蒸餾的目標層。廣泛的實驗跨越多個問答基準顯示,Neuro-RIT 始終超越強基準和增強穩健性的方法。

TRU: Targeted Reverse Update for Efficient Multimodal Recommendation Unlearning

2604.02183v1 by Zhanting Zhou, KaHou Tam, Ziqiang Zheng, Zeyu Ma, Zhanting Zhou

Multimodal recommendation systems (MRS) jointly model user-item interaction graphs and rich item content, but this tight coupling makes user data difficult to remove once learned. Approximate machine unlearning offers an efficient alternative to full retraining, yet existing methods for MRS mainly rely on a largely uniform reverse update across the model. We show that this assumption is fundamentally mismatched to modern MRS: deleted-data influence is not uniformly distributed, but concentrated unevenly across \textit{ranking behavior}, \textit{modality branches}, and \textit{network layers}. This non-uniformity gives rise to three bottlenecks in MRS unlearning: target-item persistence in the collaborative graph, modality imbalance across feature branches, and layer-wise sensitivity in the parameter space. To address this mismatch, we propose \textbf{targeted reverse update} (TRU), a plug-and-play unlearning framework for MRS. Instead of applying a blind global reversal, TRU performs three coordinated interventions across the model hierarchy: a ranking fusion gate to suppress residual target-item influence in ranking, branch-wise modality scaling to preserve retained multimodal representations, and capacity-aware layer isolation to localize reverse updates to deletion-sensitive modules. Experiments across two representative backbones, three datasets, and three unlearning regimes show that TRU consistently achieves a better retain-forget trade-off than prior approximate baselines, while security audits further confirm deeper forgetting and behavior closer to a full retraining on the retained data.

摘要:多模態推薦系統(MRS)共同建模用戶-項目互動圖和豐富的項目內容,但這種緊密耦合使得一旦學習後用戶數據難以移除。近似機器遺忘提供了一種高效的替代方案來進行全面重訓練,然而現有的MRS方法主要依賴於模型中大致均勻的反向更新。我們顯示這一假設與現代MRS根本不匹配:刪除數據的影響並不是均勻分佈的,而是集中在\textit{排名行為}、\textit{模態分支}和\textit{網絡層}之間不均勻地分佈。這種不均勻性在MRS的遺忘中產生了三個瓶頸:協作圖中目標項目的持續性、特徵分支之間的模態不平衡,以及參數空間中的層級敏感性。為了解決這一不匹配,我們提出了\textbf{針對性反向更新}(TRU),這是一個適用於MRS的即插即用遺忘框架。TRU並不是進行盲目的全局反轉,而是在模型層級中執行三個協調的干預:一個排名融合閘來抑制排名中殘留的目標項目影響、分支級模態縮放以保留保留的多模態表示,以及容量感知的層級隔離以將反向更新定位於對刪除敏感的模塊。在兩個代表性骨幹、三個數據集和三種遺忘方案上的實驗表明,TRU始終比先前的近似基準實現了更好的保留-遺忘權衡,而安全審計進一步確認了更深的遺忘和在保留數據上更接近全面重訓練的行為。

Adam's Law: Textual Frequency Law on Large Language Models

2604.02176v1 by Hongyuan Adam Lu, Z. L., Victor Wei, Zefan Zhang, Zhao Hong, Qiqi Xiang, Bowen Cao, Wai Lam

While textual frequency has been validated as relevant to human cognition in reading speed, its relatedness to Large Language Models (LLMs) is seldom studied. We propose a novel research direction in terms of textual data frequency, which is an understudied topic, to the best of our knowledge. Our framework is composed of three units. First, this paper proposes Textual Frequency Law (TFL), which indicates that frequent textual data should be preferred for LLMs for both prompting and fine-tuning. Since many LLMs are closed-source in their training data, we propose using online resources to estimate the sentence-level frequency. We then utilize an input paraphraser to paraphrase the input into a more frequent textual expression. Next, we propose Textual Frequency Distillation (TFD) by querying LLMs to conduct story completion by further extending the sentences in the datasets, and the resulting corpora are used to adjust the initial estimation. Finally, we propose Curriculum Textual Frequency Training (CTFT) that fine-tunes LLMs in an increasing order of sentence-level frequency. Experiments are conducted on our curated dataset Textual Frequency Paired Dataset (TFPD) on math reasoning, machine translation, commonsense reasoning and agentic tool calling. Results show the effectiveness of our framework.

摘要:雖然文本頻率已被驗證與人類在閱讀速度上的認知相關,但其與大型語言模型(LLMs)的關聯性卻鮮少被研究。據我們所知,我們提出了一個關於文本數據頻率的新研究方向,這是一個尚未被充分研究的主題。 我們的框架由三個單元組成。首先,本文提出了文本頻率法則(TFL),該法則指出,應該優先選擇頻繁的文本數據來用於LLMs的提示和微調。由於許多LLMs在其訓練數據中是閉源的,我們建議使用在線資源來估計句子級別的頻率。然後,我們利用一個輸入改寫器將輸入改寫為更頻繁的文本表達。接下來,我們通過查詢LLMs進行故事完成,提出了文本頻率蒸餾(TFD),進一步擴展數據集中的句子,並使用生成的語料來調整初始估計。最後,我們提出了課程文本頻率訓練(CTFT),以句子級別頻率的遞增順序微調LLMs。我們在我們精心策劃的數據集文本頻率配對數據集(TFPD)上進行了數學推理、機器翻譯、常識推理和代理工具調用的實驗。結果顯示我們框架的有效性。

GaelEval: Benchmarking LLM Performance for Scottish Gaelic

2604.02135v1 by Peter Devine, William Lamb, Beatrice Alex, Ignatius Ezeani, Dawn Knight, Mícheál J. Ó Meachair, Paul Rayson, Martin Wynne

Multilingual large language models (LLMs) often exhibit emergent 'shadow' capabilities in languages without official support, yet their performance on these languages remains uneven and under-measured. This is particularly acute for morphosyntactically rich minority languages such as Scottish Gaelic, where translation benchmarks fail to capture structural competence. We introduce GaelEval, the first multi-dimensional benchmark for Gaelic, comprising: (i) an expert-authored morphosyntactic MCQA task; (ii) a culturally grounded translation benchmark and (iii) a large-scale cultural knowledge Q&A task. Evaluating 19 LLMs against a fluent-speaker human baseline ($n=30$), we find that Gemini 3 Pro Preview achieves $83.3\%$ accuracy on the linguistic task, surpassing the human baseline ($78.1\%$). Proprietary models consistently outperform open-weight systems, and in-language (Gaelic) prompting yields a small but stable advantage (+$2.4\%$). On the cultural task, leading models exceed $90\%$ accuracy, though most systems perform worse under Gaelic prompting and absolute scores are inflated relative to the manual benchmark. Overall, GaelEval reveals that frontier models achieve above-human performance on several dimensions of Gaelic grammar, demonstrates the effect of Gaelic prompting and shows a consistent performance gap favouring proprietary over open-weight models.

摘要:多語言大型語言模型(LLMs)在沒有官方支持的語言中,經常展現出新興的「影子」能力,但它們在這些語言上的表現仍然不均衡且未被充分測量。這對於像蘇格蘭蓋爾語這樣形態句法豐富的少數語言尤其明顯,因為翻譯基準未能捕捉結構能力。我們介紹了 GaelEval,這是第一個針對蓋爾語的多維基準,包括:(i)專家撰寫的形態句法多選題(MCQA)任務;(ii)以文化為基礎的翻譯基準,以及(iii)大規模文化知識問答任務。對 19 個 LLM 進行評估,與流利講者的人類基準($n=30$)相比,我們發現 Gemini 3 Pro Preview 在語言任務上達到 $83.3\%$ 的準確率,超過人類基準($78.1\%$)。專有模型始終優於開放權重系統,而在語言內(蓋爾語)提示下則產生了小但穩定的優勢(+$2.4\%$)。在文化任務上,領先模型的準確率超過 $90\%$,儘管大多數系統在蓋爾語提示下表現較差,且相對於手動基準的絕對分數被膨脹。總體而言,GaelEval 顯示出前沿模型在蓋爾語語法的幾個維度上達到超越人類的表現,展示了蓋爾語提示的效果,並顯示出專有模型相對於開放權重模型的穩定性能差距。

Optimizing RAG Rerankers with LLM Feedback via Reinforcement Learning

2604.02091v1 by Yuhang Wu, Xiangqing Shen, Fanfan Wang, Cangqi Zhou, Zhen Wu, Xinyu Dai, Rui Xia

Rerankers play a pivotal role in refining retrieval results for Retrieval-Augmented Generation. However, current reranking models are typically optimized on static human annotated relevance labels in isolation, decoupled from the downstream generation process. This isolation leads to a fundamental misalignment: documents identified as topically relevant by information retrieval metrics often fail to provide the actual utility required by the LLM for precise answer generation. To bridge this gap, we introduce ReRanking Preference Optimization (RRPO), a reinforcement learning framework that directly aligns reranking with the LLM's generation quality. By formulating reranking as a sequential decision-making process, RRPO optimizes for context utility using LLM feedback, thereby eliminating the need for expensive human annotations. To ensure training stability, we further introduce a reference-anchored deterministic baseline. Extensive experiments on knowledge-intensive benchmarks demonstrate that RRPO significantly outperforms strong baselines, including the powerful list-wise reranker RankZephyr. Further analysis highlights the versatility of our framework: it generalizes seamlessly to diverse readers (e.g., GPT-4o), integrates orthogonally with query expansion modules like Query2Doc, and remains robust even when trained with noisy supervisors.

摘要:Rerankers 在精煉檢索結果以進行檢索增強生成中扮演著關鍵角色。然而,目前的重新排序模型通常是在靜態的人類標註相關性標籤上進行優化,與下游生成過程脫節。這種脫節導致了根本的不一致:信息檢索指標所識別的主題相關文件,往往無法提供 LLM 進行精確回答生成所需的實際效用。為了彌補這一差距,我們引入了 ReRanking Preference Optimization (RRPO),這是一個強化學習框架,直接將重新排序與 LLM 的生成質量對齊。通過將重新排序表述為一個序列決策過程,RRPO 利用 LLM 反饋優化上下文效用,從而消除了對昂貴的人類標註的需求。為了確保訓練的穩定性,我們進一步引入了一個參考錨定的確定性基線。在知識密集型基準上的大量實驗表明,RRPO 顯著超越了強大的基線,包括強大的列表式重新排序器 RankZephyr。進一步的分析突顯了我們框架的多樣性:它能無縫地推廣到各種讀者(例如,GPT-4o),與查詢擴展模塊(如 Query2Doc)正交整合,即使在用噪聲監督訓練時也保持穩健。

Mining Instance-Centric Vision-Language Contexts for Human-Object Interaction Detection

2604.02071v1 by Soo Won Seo, KyungChae Lee, Hyungchan Cho, Taein Son, Nam Ik Cho, Jun Won Choi

Human-Object Interaction (HOI) detection aims to localize human-object pairs and classify their interactions from a single image, a task that demands strong visual understanding and nuanced contextual reasoning. Recent approaches have leveraged Vision-Language Models (VLMs) to introduce semantic priors, significantly improving HOI detection performance. However, existing methods often fail to fully capitalize on the diverse contextual cues distributed across the entire scene. To overcome these limitations, we propose the Instance-centric Context Mining Network (InCoM-Net)-a novel framework that effectively integrates rich semantic knowledge extracted from VLMs with instance-specific features produced by an object detector. This design enables deeper interaction reasoning by modeling relationships not only within each detected instance but also across instances and their surrounding scene context. InCoM-Net comprises two core components: Instancecentric Context Refinement (ICR), which separately extracts intra-instance, inter-instance, and global contextual cues from VLM-derived features, and Progressive Context Aggregation (ProCA), which iteratively fuses these multicontext features with instance-level detector features to support high-level HOI reasoning. Extensive experiments on the HICO-DET and V-COCO benchmarks show that InCoM-Net achieves state-of-the-art performance, surpassing previous HOI detection methods. Code is available at https://github.com/nowuss/InCoM-Net.

摘要:人類-物體互動(HOI)檢測旨在從單一圖像中定位人類-物體對並分類其互動,這是一項需要強大視覺理解和細緻上下文推理的任務。最近的方法利用視覺-語言模型(VLMs)引入語義先驗,顯著提高了HOI檢測的性能。然而,現有方法往往未能充分利用分佈在整個場景中的多樣上下文線索。為了克服這些限制,我們提出了以實例為中心的上下文挖掘網絡(InCoM-Net)——一個新穎的框架,有效地將從VLM中提取的豐富語義知識與物體檢測器產生的實例特徵整合。這一設計通過建模不僅在每個檢測到的實例內部,還在實例之間及其周圍場景上下文中的關係,實現了更深入的互動推理。InCoM-Net包含兩個核心組件:以實例為中心的上下文精煉(ICR),它分別從VLM衍生的特徵中提取實例內、實例間和全局上下文線索,以及漸進式上下文聚合(ProCA),它迭代地將這些多上下文特徵與實例級檢測器特徵融合,以支持高級HOI推理。在HICO-DET和V-COCO基準上的大量實驗表明,InCoM-Net達到了最先進的性能,超越了之前的HOI檢測方法。代碼可在 https://github.com/nowuss/InCoM-Net 獲得。

Diff-KD: Diffusion-based Knowledge Distillation for Collaborative Perception under Corruptions

2604.02061v1 by Pengcheng Lyu, Chaokun Zhang, Gong Chen, Tao Tang, Zhaoxiang Luo

Multi-agent collaborative perception enables autonomous systems to overcome individual sensing limits through collective intelligence. However, real-world sensor and communication corruptions severely undermine this advantage. Crucially, existing approaches treat corruptions as static perturbations or passively conform to corrupted inputs, failing to actively recover the underlying clean semantics. To address this limitation, we introduce Diff-KD, a framework that integrates diffusion-based generative refinement into teacher-student knowledge distillation for robust collaborative perception. Diff-KD features two core components: (i) Progressive Knowledge Distillation (PKD), which treats local feature restoration as a conditional diffusion process to recover global semantics from corrupted observations; and (ii) Adaptive Gated Fusion (AGF), which dynamically weights neighbors based on ego reliability during fusion. Evaluated on OPV2V and DAIR-V2X under seven corruption types, Diff-KD achieves state-of-the-art performance in both detection accuracy and calibration robustness.

摘要:多智能體協作感知使自主系統能夠通過集體智慧克服個別感知的限制。然而,現實世界中的感測器和通信干擾嚴重削弱了這一優勢。關鍵是,現有的方法將干擾視為靜態擾動或被動地適應受損的輸入,未能主動恢復潛在的乾淨語義。為了解決這一限制,我們介紹了Diff-KD,一個將基於擴散的生成精煉整合到教師-學生知識蒸餾中的框架,以實現穩健的協作感知。Diff-KD具有兩個核心組件:(i) 漸進式知識蒸餾(PKD),將局部特徵恢復視為條件擴散過程,以從受損觀察中恢復全局語義;(ii) 自適應門控融合(AGF),在融合過程中根據自我可靠性動態加權鄰居。在七種干擾類型下對OPV2V和DAIR-V2X進行評估,Diff-KD在檢測準確性和校準穩健性方面達到了最先進的性能。

SAFE: Stepwise Atomic Feedback for Error correction in Multi-hop Reasoning

2604.01993v1 by Daeyong Kwon, Soyoung Yoon, Seung-won Hwang

Multi-hop QA benchmarks frequently reward Large Language Models (LLMs) for spurious correctness, masking ungrounded or flawed reasoning steps. To shift toward rigorous reasoning, we propose SAFE, a dynamic benchmarking framework that replaces the ungrounded Chain-of-Thought (CoT) with a strictly verifiable sequence of grounded entities. Our framework operates across two phases: (1) train-time verification, where we establish an atomic error taxonomy and a Knowledge Graph (KG)-grounded verification pipeline to eliminate noisy supervision in standard benchmarks, identifying up to 14% of instances as unanswerable, and (2) inference-time verification, where a feedback model trained on this verified dataset dynamically detects ungrounded steps in real-time. Experimental results demonstrate that SAFE not only exposes the critical flaws of existing benchmarks at train-time, but also significantly outperforms standard baselines, achieving an average accuracy gain of 8.4 pp while guaranteeing verifiable trajectories at inference-time.

摘要:多跳 QA 基準常常因表面正確性而獎勵大型語言模型(LLMs),掩蓋了未經證實或有缺陷的推理步驟。為了轉向嚴謹的推理,我們提出了 SAFE,一個動態基準框架,將未經證實的思維鏈(CoT)替換為一個嚴格可驗證的基於實體的序列。 我們的框架分為兩個階段運作: (1) 訓練時驗證,我們建立了一個原子錯誤分類法和一個基於知識圖譜(KG)的驗證管道,以消除標準基準中的噪聲監督,並將多達 14% 的實例識別為無法回答, (2) 推理時驗證,訓練於這個經過驗證的數據集的反饋模型能夠實時動態檢測未經證實的步驟。 實驗結果顯示,SAFE 不僅在訓練時揭示了現有基準的關鍵缺陷,還顯著超越了標準基準,實現了平均準確率提升 8.4 個百分點,同時在推理時保證可驗證的軌跡。

Do We Need Bigger Models for Science? Task-Aware Retrieval with Small Language Models

2604.01965v1 by Florian Kelber, Matthias Jobst, Yuni Susanti, Michael Färber

Scientific knowledge discovery increasingly relies on large language models, yet many existing scholarly assistants depend on proprietary systems with tens or hundreds of billions of parameters. Such reliance limits reproducibility and accessibility for the research community. In this work, we ask a simple question: do we need bigger models for scientific applications? Specifically, we investigate to what extent carefully designed retrieval pipelines can compensate for reduced model scale in scientific applications. We design a lightweight retrieval-augmented framework that performs task-aware routing to select specialized retrieval strategies based on the input query. The system further integrates evidence from full-text scientific papers and structured scholarly metadata, and employs compact instruction-tuned language models to generate responses with citations. We evaluate the framework across several scholarly tasks, focusing on scholarly question answering (QA), including single- and multi-document scenarios, as well as biomedical QA under domain shift and scientific text compression. Our findings demonstrate that retrieval and model scale are complementary rather than interchangeable. While retrieval design can partially compensate for smaller models, model capacity remains important for complex reasoning tasks. This work highlights retrieval and task-aware design as key factors for building practical and reproducible scholarly assistants.

摘要:科學知識發現越來越依賴大型語言模型,但許多現有的學術助手卻依賴於擁有數十億或數百億參數的專有系統。這種依賴限制了研究社群的可重複性和可及性。在這項工作中,我們提出了一個簡單的問題:我們是否需要更大的模型來應用於科學?具體來說,我們調查精心設計的檢索管道在多大程度上可以彌補科學應用中模型規模的減少。我們設計了一個輕量級的檢索增強框架,該框架執行任務感知路由,以根據輸入查詢選擇專門的檢索策略。該系統進一步整合來自全文科學論文和結構化學術元數據的證據,並使用緊湊的指令調整語言模型生成帶有引用的回應。我們在幾個學術任務中評估該框架,重點關注學術問答(QA),包括單文檔和多文檔場景,以及在領域轉移和科學文本壓縮下的生物醫學QA。我們的研究結果表明,檢索和模型規模是互補的,而非可互換的。雖然檢索設計可以部分彌補較小模型的不足,但模型容量在複雜推理任務中仍然重要。這項工作突顯了檢索和任務感知設計是構建實用且可重複的學術助手的關鍵因素。

Abnormal Head Movements in Neurological Conditions: A Knowledge-Based Dataset with Application to Cervical Dystonia

2604.01962v1 by Saja Al-Dabet, Sherzod Turaev, Nazar Zaki

Abnormal head movements (AHMs) manifest across a broad spectrum of neurological disorders; however, the absence of a multi-condition resource integrating kinematic measurements, clinical severity scores, and patient demographics constitutes a persistent barrier to the development of AI-driven diagnostic tools. To address this gap, this study introduces NeuroPose-AHM, a knowledge-based dataset of neurologically induced AHMs constructed through a multi-LLM extraction framework applied to 1,430 peer-reviewed publications. The dataset contains 2,756 patient-group-level records spanning 57 neurological conditions, derived from 846 AHM-relevant papers. Inter-LLM reliability analysis confirms robust extraction performance, with study-level classification achieving strong agreement (kappa = 0.822). To demonstrate the dataset's analytical utility, a four-task framework is applied to cervical dystonia (CD), the condition most directly defined by pathological head movement. First, Task 1 performs multi-label AHM type classification (F1 = 0.856). Task 2 constructs the Head-Neck Severity Index (HNSI), a unified metric that normalizes heterogeneous clinical rating scales. The clinical relevance of this index is then evaluated in Task 3, where HNSI is validated against real-world CD patient data, with aligned severe-band proportions (6.7%) providing a preliminary plausibility indication for index calibration within the high severity range. Finally, Task 4 performs bridge analysis between movement-type probabilities and HNSI scores, producing significant correlations (p less than 0.001). These results demonstrate the analytical utility of NeuroPose-AHM as a structured, knowledge-based resource for neurological AHM research. The NeuroPose-AHM dataset is publicly available on Zenodo (https://doi.org/10.5281/zenodo.19386862).

摘要:異常頭部運動(AHMs)在廣泛的神經疾病中表現出來;然而,缺乏一個整合運動學測量、臨床嚴重程度評分和患者人口統計的多條件資源,構成了開發基於人工智慧的診斷工具的持續障礙。為了解決這一問題,本研究介紹了NeuroPose-AHM,這是一個基於知識的神經誘發AHMs數據集,通過應用於1,430篇經過同行評審的出版物的多LLM提取框架構建而成。該數據集包含2,756個患者群體級別的記錄,涵蓋57種神經疾病,來源於846篇與AHM相關的論文。跨LLM可靠性分析確認了穩健的提取性能,研究級別的分類達到強一致性(kappa = 0.822)。為了展示該數據集的分析效用,將四任務框架應用於頸部肌張力障礙(CD),這是由病理性頭部運動最直接定義的疾病。首先,任務1執行多標籤AHM類型分類(F1 = 0.856)。任務2構建頭頸嚴重程度指數(HNSI),這是一個統一的指標,將異質的臨床評分標準進行標準化。然後在任務3中評估該指數的臨床相關性,其中HNSI與現實世界的CD患者數據進行驗證,對應的重度比例(6.7%)為指數在高嚴重程度範圍內的校準提供了初步的合理性指示。最後,任務4在運動類型概率和HNSI分數之間進行橋接分析,產生了顯著的相關性(p小於0.001)。這些結果展示了NeuroPose-AHM作為一個結構化的、基於知識的神經AHM研究資源的分析效用。NeuroPose-AHM數據集在Zenodo上公開可用(https://doi.org/10.5281/zenodo.19386862)。

How to measure the optimality of word or gesture order with respect to the principle of swap distance minimization

2604.01938v1 by Ramon Ferrer-i-Cancho

The structure of all the permutations of a sequence can be represented as a permutohedron, a graph where vertices are permutations and two vertices are linked if a swap of adjacent elements in the permutation of one of the vertices produces the permutation of the other vertex. It has been hypothesized that word orders in languages minimize the swap distance in the permutohedron: given a source order, word orders that are closer in the permutohedron should be less costly and thus more likely. Here we explain how to measure the degree of optimality of word order variation with respect to swap distance minimization. We illustrate the power of our novel mathematical framework by showing that crosslinguistic gestures are at least $77\%$ optimal. It is unlikely that the multiple times where crosslinguistic gestures hit optimality are due to chance. We establish the theoretical foundations for research on the optimality of word or gesture order with respect to swap distance minimization in communication systems. Finally, we introduce the quadratic assignment problem (QAP) into language research as an umbrella for multiple optimization problems and, accordingly, postulate a general principle of optimal assignment that unifies various linguistic principles including swap distance minimization.

摘要:所有序列的排列結構可以表示為一個排列多面體(permutohedron),這是一個圖,其中頂點是排列,兩個頂點相連如果在其中一個頂點的排列中相鄰元素的交換產生了另一個頂點的排列。有人假設語言中的詞序最小化排列多面體中的交換距離:給定一個源詞序,排列多面體中更接近的詞序應該成本較低,因此更有可能。在這裡,我們解釋如何測量詞序變化的最佳性程度,以最小化交換距離。我們通過顯示跨語言手勢至少達到 $77\%$ 的最佳性來說明我們新數學框架的威力。跨語言手勢多次達到最佳性不太可能是偶然的。我們為關於詞序或手勢序在通訊系統中最小化交換距離的最佳性研究建立了理論基礎。最後,我們將二次分配問題(QAP)引入語言研究,作為多個優化問題的總稱,並因此假設一個統一各種語言原則的最佳分配的一般原則,包括交換距離最小化。

CANDI: Curated Test-Time Adaptation for Multivariate Time-Series Anomaly Detection Under Distribution Shift

2604.01845v1 by HyunGi Kim, Jisoo Mok, Hyungyu Lee, Juhyeon Shin, Sungroh Yoon

Multivariate time-series anomaly detection (MTSAD) aims to identify deviations from normality in multivariate time-series and is critical in real-world applications. However, in real-world deployments, distribution shifts are ubiquitous and cause severe performance degradation in pre-trained anomaly detector. Test-time adaptation (TTA) updates a pre-trained model on-the-fly using only unlabeled test data, making it promising for addressing this challenge. In this study, we propose CANDI (Curated test-time adaptation for multivariate time-series ANomaly detection under DIstribution shift), a novel TTA framework that selectively identifies and adapts to potential false positives while preserving pre-trained knowledge. CANDI introduces a False Positive Mining (FPM) strategy to curate adaptation samples based on anomaly scores and latent similarity, and incorporates a plug-and-play Spatiotemporally-Aware Normality Adaptation (SANA) module for structurally informed model updates. Extensive experiments demonstrate that CANDI significantly improves the performance of MTSAD under distribution shift, improving AUROC up to 14% while using fewer adaptation samples.

摘要:多變量時間序列異常檢測(MTSAD)旨在識別多變量時間序列中的正常性偏差,並在現實應用中至關重要。然而,在現實部署中,分佈變化無處不在,並導致預訓練異常檢測器的性能嚴重下降。測試時適應(TTA)僅使用未標記的測試數據即時更新預訓練模型,使其在應對這一挑戰時顯得前景可期。在本研究中,我們提出了CANDI(針對分佈變化的多變量時間序列異常檢測的精選測試時適應),這是一個新穎的TTA框架,能夠選擇性地識別和適應潛在的假陽性,同時保留預訓練知識。CANDI引入了一種假陽性挖掘(FPM)策略,根據異常分數和潛在相似性來篩選適應樣本,並結合了一個即插即用的時空感知正常性適應(SANA)模塊,以進行結構性知識更新。大量實驗表明,CANDI在分佈變化下顯著提高了MTSAD的性能,AUROC提高了多達14%,同時使用了更少的適應樣本。

Domain-constrained knowledge representation: A modal framework

2604.01770v1 by Chao Li, Yuru Wang, Chunyi Zhao

Knowledge graphs store large numbers of relations efficiently, but they remain weak at representing a quieter difficulty: the meaning of a concept often shifts with the domain in which it is used. A triple such as Apple, instance-of, Company may be acceptable in one setting while being misleading or unusable in another. In most current systems, domain information is attached as metadata, qualifiers, or graph-level organization. These mechanisms help with filtering and provenance, but they usually do not alter the formal status of the assertion itself. This paper argues that domain should be treated as part of knowledge representation rather than as supplementary annotation. It introduces the Domain-Contextualized Concept Graph (DCG), a framework in which domain is written into the relation and interpreted as a modal world constraint. In the DCG form (C, R at D, C'), the marker at D identifies the world in which the relation holds. Formally, the relation is interpreted through a domain-indexed necessity operator, so that truth, inference, and conflict checking are all scoped to the relevant world. This move has three consequences: ambiguous concepts can be disambiguated at the point of representation; invalid assertions can be challenged against their domain; cross-domain relations can be connected through explicit predicates. The paper develops this claim through a Kripke-style semantics, a compact predicate system, a Prolog implementation, and mappings to RDF, OWL, and relational databases. The contribution is a representational reinterpretation of domain itself. The central claim is that many practical failures in knowledge systems begin when domain is treated as external to the assertion. DCG addresses that by giving domain a structural and computable role inside the representation.

摘要:知識圖譜有效地儲存大量的關係,但在表達一個更微妙的困難上仍然顯得薄弱:概念的意義往往隨著使用的領域而變化。像 Apple, instance-of, Company 這樣的三元組在一個環境中可能是可接受的,而在另一個環境中則可能會產生誤導或無法使用。在大多數當前系統中,領域信息作為元數據、限定詞或圖層組織附加。這些機制有助於過濾和來源追溯,但通常不會改變斷言本身的正式狀態。本文主張,領域應被視為知識表達的一部分,而非補充註釋。它引入了領域情境化概念圖(DCG),這是一個將領域寫入關係並解釋為模態世界約束的框架。在 DCG 形式 (C, R at D, C') 中,D 的標記標識了關係成立的世界。正式地,該關係通過一個領域索引的必要運算符來解釋,因此真理、推理和衝突檢查都被限制在相關的世界範疇內。這一舉措有三個後果:模糊的概念可以在表達的時候進行消歧;無效的斷言可以根據其領域受到挑戰;跨領域的關係可以通過明確的謂詞連接。本文通過克里普克風格的語義學、一個緊湊的謂詞系統、一個 Prolog 實現,以及對 RDF、OWL 和關係數據庫的映射來發展這一主張。其貢獻在於對領域本身的表達重新詮釋。核心主張是,許多知識系統中的實際失敗始於將領域視為斷言的外部。DCG 通過在表達內部賦予領域結構性和可計算的角色來解決這一問題。

FSKD: Monocular Forest Structure Inference via LiDAR-to-RGBI Knowledge Distillation

2604.01766v1 by Taimur Khan, Hannes Feilhauer, Muhammad Jazib Zafar

Very High Resolution (VHR) forest structure data at individual-tree scale is essential for carbon, biodiversity, and ecosystem monitoring. Still, airborne LiDAR remains costly and infrequent despite being the reference for forest structure metrics like Canopy Height Model (CHM), Plant Area Index (PAI), and Foliage Height Diversity (FHD). We propose FSKD: a LiDAR-to-RGB-Infrared (RGBI) knowledge distillation (KD) framework in which a multi-modal teacher fuses RGBI imagery with LiDAR-derived planar metrics and vertical profiles via cross-attention, and an RGBI-only SegFormer student learns to reproduce these outputs. Trained on 384 $km^2$ of forests in Saxony, Germany (20 cm ground sampling distance (GSD)) and evaluated on eight geographically distinct test tiles, the student achieves state-of-the-art (SOTA) zero-shot CHM performance (MedAE 4.17 m, $R^2$=0.51, IoU 0.87), outperforming HRCHM/DAC baselines by 29--46% in MAE (5.81 m vs. 8.14--10.84 m) with stronger correlation coefficients (0.713 vs. 0.166--0.652). Ablations show that multi-modal fusion improves performance by 10--26% over RGBI-only training, and that asymmetric distillation with appropriate model capacity is critical. The method jointly predicts CHM, PAI, and FHD, a multi-metric capability not provided by current monocular CHM estimators, although PAI/FHD transfer remains region-dependent and benefits from local calibration. The framework also remains effective under temporal mismatch (winter LiDAR, summer RGBI), removing strict co-acquisition constraints and enabling scalable 20 cm operational monitoring for workflows such as Digital Twin Germany and national Digital Orthophoto programs.

摘要:非常高解析度 (VHR) 的森林結構數據在單棵樹的尺度上對於碳、生物多樣性和生態系統監測至關重要。儘管空中LiDAR仍然是森林結構指標(如樹冠高度模型 (CHM)、植物面積指數 (PAI) 和葉片高度多樣性 (FHD))的參考,但其成本高昂且使用頻率不高。我們提出了FSKD:一個LiDAR到RGB-紅外 (RGBI) 知識蒸餾 (KD) 框架,其中一個多模態教師通過交叉注意力將RGBI影像與LiDAR衍生的平面指標和垂直剖面融合,而僅使用RGBI的SegFormer學生則學習重現這些輸出。在德國薩克森州的384 $km^2$ 森林上進行訓練(地面取樣距離 (GSD) 為20厘米),並在八個地理上不同的測試區塊上進行評估,該學生在零樣本CHM性能上達到了最先進的 (SOTA) 表現(MedAE 4.17 m,$R^2$=0.51,IoU 0.87),在MAE方面超越了HRCHM/DAC基準29--46%(5.81 m對比8.14--10.84 m),並且具有更強的相關係數(0.713對比0.166--0.652)。消融實驗顯示,多模態融合在性能上比僅RGBI訓練提高了10--26%,而且具備適當模型容量的非對稱蒸餾是關鍵。該方法共同預測CHM、PAI和FHD,這是一種當前單目CHM估計器所不具備的多指標能力,儘管PAI/FHD的轉移仍然依賴於區域,並受益於本地校準。該框架在時間不匹配(冬季LiDAR,夏季RGBI)下仍然有效,消除了嚴格的共同獲取限制,並為數位雙胞胎德國和國家數位正射影像計畫等工作流程實現可擴展的20厘米操作監測。

AeroTherm-GPT: A Verification-Centered LLM Framework for Thermal Protection System Engineering Workflows

2604.01738v1 by Chuhan Qiao, Jinglai Zheng, Jie Huang, Buyue Zhao, Fan Li, Haiming Huang

Integrating Large Language Models (LLMs) into hypersonic thermal protection system (TPS) design is bottlenecked by cascading constraint violations when generating executable simulation artifacts. General-purpose LLMs, treating generation as single-pass text completion, fail to satisfy the sequential, multi-gate constraints inherent in safety-critical engineering workflows. To address this, we propose AeroTherm-GPT, the first TPS-specialized LLM Agent, instantiated through a Constraint-Closed-Loop Generation (CCLG) framework. CCLG organizes TPS artifact generation as an iterative workflow comprising generation, validation, CDG-guided repair, execution, and audit. The Constraint Dependency Graph (CDG) encodes empirical co-resolution structure among constraint categories, directing repair toward upstream fault candidates based on lifecycle ordering priors and empirical co-resolution probabilities. This upstream-priority mechanism resolves multiple downstream violations per action, achieving a Root-Cause Fix Efficiency of 4.16 versus 1.76 for flat-checklist repair. Evaluated on HyTPS-Bench and validated against external benchmarks, AeroTherm-GPT achieves 88.7% End-to-End Success Rate (95% CI: 87.5-89.9), a gain of +12.5 pp over the matched non-CDG ablation baseline, without catastrophic forgetting on scientific reasoning and code generation tasks.

摘要:將大型語言模型 (LLMs) 整合到超音速熱保護系統 (TPS) 設計中,因生成可執行的模擬工件時出現級聯約束違反而受到瓶頸。通用 LLMs 將生成視為單次文本完成,無法滿足安全關鍵工程工作流程中固有的序列多閘約束。為了解決這個問題,我們提出了 AeroTherm-GPT,第一個專門針對 TPS 的 LLM 代理,通過約束閉環生成 (CCLG) 框架實現。CCLG 將 TPS 工件生成組織為一個迭代工作流程,包括生成、驗證、CDG 引導的修復、執行和審核。約束依賴圖 (CDG) 編碼了約束類別之間的實證共同解決結構,根據生命週期排序的先驗和實證共同解決概率,將修復指向上游故障候選。這一上游優先機制每個行動解決多個下游違規,實現了 4.16 的根本原因修復效率,相較於 1.76 的平面清單修復。經過在 HyTPS-Bench 上評估並與外部基準驗證,AeroTherm-GPT 實現了 88.7% 的端到端成功率 (95% CI: 87.5-89.9),相比匹配的非 CDG 消融基線提高了 +12.5 個百分點,且在科學推理和代碼生成任務上沒有出現災難性遺忘。

The AnIML Ontology: Enabling Semantic Interoperability for Large-Scale Experimental Data in Interconnected Scientific Labs

2604.01728v1 by Wilf Morlidge, Elliott Watkiss-Leek, George Hannah, Harry Rostron, Andrew Ng, Ewan Johnson, Andrew Mitchell, Terry R. Payne, Valentina Tamma, Jacopo de Berardinis

Achieving semantic interoperability across heterogeneous experimental data systems remains a major barrier to data-driven scientific discovery. The Analytical Information Markup Language (AnIML), a flexible XML-based standard for analytical chemistry and biology, is increasingly used in industrial R&D labs for managing and exchanging experimental data. However, the expressivity of the XML schema permits divergent interpretations across stakeholders, introducing inconsistencies that undermine the interoperability the AnIML schema was designed to support. In this paper, we present the AnIML Ontology, an OWL 2 ontology that formalises the semantics of AnIML and aligns it with the Allotrope Data Format to support future cross-system and cross-lab interoperability. The ontology was developed using an expert-in-the-loop approach combining LLM-assisted requirement elicitation with collaborative ontology engineering. We validate the ontology through a multi-layered approach: data-driven transformation of real-world AnIML files into knowledge graphs, competency question verification via SPARQL, and a novel validation protocol based on adversarial negative competency questions mapped to established ontological anti-patterns and enforced via SHACL constraints.

摘要:實現異質實驗數據系統之間的語義互操作性仍然是數據驅動科學發現的一大障礙。分析信息標記語言(AnIML)是一種靈活的基於XML的標準,用於分析化學和生物學,越來越多地被工業研發實驗室用於管理和交換實驗數據。然而,XML架構的表達能力允許利益相關者之間存在不同的解釋,這引入了不一致性,削弱了AnIML架構所設計支持的互操作性。在本文中,我們提出了AnIML本體,一種OWL 2本體,正式化AnIML的語義並將其與Allotrope數據格式對齊,以支持未來的跨系統和跨實驗室互操作性。該本體是通過專家參與的方式開發的,結合了LLM輔助的需求引導和協作本體工程。我們通過多層次的方法驗證該本體:將現實世界的AnIML文件數據驅動地轉換為知識圖譜,通過SPARQL進行能力問題驗證,以及基於對抗性負能力問題的創新驗證協議,這些問題映射到已建立的本體反模式並通過SHACL約束強制執行。

LiteInception: A Lightweight and Interpretable Deep Learning Framework for General Aviation Fault Diagnosis

2604.01725v1 by Zhihuan Wei, Xinhang Chen, Danyang Han, Yang Hu, Jie Liu, Xuewen Miao, Guijiang Li

General aviation fault diagnosis and efficient maintenance are critical to flight safety; however, deploying deep learning models on resource-constrained edge devices poses dual challenges in computational capacity and interpretability. This paper proposes LiteInception--a lightweight interpretable fault diagnosis framework designed for edge deployment. The framework adopts a two-stage cascaded architecture aligned with standard maintenance workflows: Stage 1 performs high-recall fault detection, and Stage 2 conducts fine-grained fault classification on anomalous samples, thereby decoupling optimization objectives and enabling on-demand allocation of computational resources. For model compression, a multi-method fusion strategy based on mutual information, gradient analysis, and SE attention weights is proposed to reduce the input sensor channels from 23 to 15, and a 1+1 branch LiteInception architecture is introduced that compresses InceptionTime parameters by 70%, accelerates CPU inference by over 8x, with less than 3% F1 loss. Furthermore, knowledge distillation is introduced as a precision-recall regulation mechanism, enabling the same lightweight model to adapt to different scenarios--such as safety-critical and auxiliary diagnosis--by switching training strategies. Finally, a dual-layer interpretability framework integrating four attribution methods is constructed, providing traceable evidence chains of "which sensor x which time period." Experiments on the NGAFID dataset demonstrate a fault detection accuracy of 81.92% with 83.24% recall, and a fault identification accuracy of 77.00%, validating the framework's favorable balance among efficiency, accuracy, and interpretability.

摘要:一般航空故障診斷和高效維護對於飛行安全至關重要;然而,在資源受限的邊緣設備上部署深度學習模型面臨計算能力和可解釋性兩方面的挑戰。本文提出了LiteInception——一種設計用於邊緣部署的輕量級可解釋故障診斷框架。該框架採用與標準維護工作流程對齊的兩階段級聯架構:第一階段執行高召回率的故障檢測,第二階段對異常樣本進行細粒度故障分類,從而解耦優化目標並實現計算資源的按需分配。為了進行模型壓縮,提出了一種基於互信息、梯度分析和SE注意權重的多方法融合策略,將輸入傳感器通道從23個減少到15個,並引入了一種1+1分支的LiteInception架構,將InceptionTime參數壓縮70%,使CPU推理加速超過8倍,F1損失低於3%。此外,引入知識蒸餾作為精度-召回調節機制,使得相同的輕量級模型能夠通過切換訓練策略適應不同場景——例如安全關鍵和輔助診斷。最後,構建了一個整合四種歸因方法的雙層可解釋性框架,提供“哪個傳感器 x 哪個時間段”的可追溯證據鏈。在NGAFID數據集上的實驗顯示,故障檢測準確率為81.92%,召回率為83.24%,故障識別準確率為77.00%,驗證了該框架在效率、準確性和可解釋性之間的良好平衡。

Human-Guided Reasoning with Large Language Models for Vietnamese Speech Emotion Recognition

2604.01711v1 by Truc Nguyen, Then Tran, Binh Truong, Phuoc Nguyen T. H

Vietnamese Speech Emotion Recognition (SER) remains challenging due to ambiguous acoustic patterns and the lack of reliable annotated data, especially in real-world conditions where emotional boundaries are not clearly separable. To address this problem, this paper proposes a human-machine collaborative framework that integrates human knowledge into the learning process rather than relying solely on data-driven models. The proposed framework is centered around LLM-based reasoning, where acoustic feature-based models are used to provide auxiliary signals such as confidence and feature-level evidence. A confidence-based routing mechanism is introduced to distinguish between easy and ambiguous samples, allowing uncertain cases to be delegated to LLMs for deeper reasoning guided by structured rules derived from human annotation behavior. In addition, an iterative refinement strategy is employed to continuously improve system performance through error analysis and rule updates. Experiments are conducted on a Vietnamese speech dataset of 2,764 samples across three emotion classes (calm, angry, panic), with high inter-annotator agreement (Fleiss Kappa = 0.8574), ensuring reliable ground truth. The proposed method achieves strong performance, reaching up to 86.59% accuracy and Macro F1 around 0.85-0.86, demonstrating its effectiveness in handling ambiguous and hard-to-classify cases. Overall, this work highlights the importance of combining data-driven models with human reasoning, providing a robust and model-agnostic approach for speech emotion recognition in low-resource settings.

摘要:越南語語音情感識別(SER)由於模糊的聲學模式和缺乏可靠的標註數據,仍然面臨挑戰,特別是在情感邊界不明確的現實條件下。為了解決這個問題,本文提出了一個人機協作框架,將人類知識整合到學習過程中,而不是僅僅依賴數據驅動的模型。所提出的框架圍繞基於大語言模型(LLM)的推理展開,其中使用基於聲學特徵的模型提供輔助信號,例如信心和特徵級證據。引入了一種基於信心的路由機制,以區分簡單和模糊的樣本,允許不確定的案例委託給LLM進行更深入的推理,這些推理受到來自人類標註行為的結構化規則的指導。此外,採用了一種迭代精煉策略,通過錯誤分析和規則更新不斷提高系統性能。在一個包含2,764個樣本的越南語語音數據集上進行了實驗,涵蓋三個情感類別(平靜、憤怒、驚慌),具有高的標註者間一致性(Fleiss Kappa = 0.8574),確保了可靠的真實標準。所提出的方法達到了強大的性能,準確率高達86.59%,宏觀F1約為0.85-0.86,顯示出其在處理模糊和難以分類的案例中的有效性。總體而言,這項工作突顯了將數據驅動模型與人類推理相結合的重要性,提供了一種強健且與模型無關的語音情感識別方法,適用於資源匱乏的環境。

Memory in the LLM Era: Modular Architectures and Strategies in a Unified Framework

2604.01707v1 by Yanchen Wu, Tenghui Lin, Yingli Zhou, Fangyuan Zhang, Qintian Guo, Xun Zhou, Sibo Wang, Xilin Liu, Yuchi Ma, Yixiang Fang

Memory emerges as the core module in the large language model (LLM)-based agents for long-horizon complex tasks (e.g., multi-turn dialogue, game playing, scientific discovery), where memory can enable knowledge accumulation, iterative reasoning and self-evolution. A number of memory methods have been proposed in the literature. However, these methods have not been systematically and comprehensively compared under the same experimental settings. In this paper, we first summarize a unified framework that incorporates all the existing agent memory methods from a high-level perspective. We then extensively compare representative agent memory methods on two well-known benchmarks and examine the effectiveness of all methods, providing a thorough analysis of those methods. As a byproduct of our experimental analysis, we also design a new memory method by exploiting modules in the existing methods, which outperforms the state-of-the-art methods. Finally, based on these findings, we offer promising future research opportunities. We believe that a deeper understanding of the behavior of existing methods can provide valuable new insights for future research.

摘要:記憶在基於大型語言模型 (LLM) 的代理中成為長期複雜任務(例如,多輪對話、遊戲、科學發現)的核心模塊,記憶可以促進知識積累、迭代推理和自我演變。文獻中提出了多種記憶方法。然而,這些方法在相同的實驗設置下尚未進行系統性和全面的比較。在本文中,我們首先從高層次的角度總結了一個統一框架,該框架整合了所有現有的代理記憶方法。然後,我們在兩個知名基準上廣泛比較了代表性的代理記憶方法,並檢查了所有方法的有效性,提供了對這些方法的徹底分析。作為我們實驗分析的副產品,我們還通過利用現有方法中的模塊設計了一種新的記憶方法,該方法超越了最先進的技術。最後,基於這些發現,我們提供了有前景的未來研究機會。我們相信,對現有方法行為的更深入理解可以為未來的研究提供有價值的新見解。

MiCA Learns More Knowledge Than LoRA and Full Fine-Tuning

2604.01694v1 by Sten Rüdiger, Sebastian Raschka

Minor Component Adaptation (MiCA) is a novel parameter-efficient fine-tuning method for large language models that focuses on adapting underutilized subspaces of model representations. Unlike conventional methods such as Low-Rank Adaptation (LoRA), which target dominant subspaces, MiCA leverages Singular Value Decomposition to identify subspaces related to minor singular vectors associated with the least significant singular values and constrains the update of parameters during fine-tuning to those directions. This strategy leads to up to 5.9x improvement in knowledge acquisition under optimized training hyperparameters and a minimal parameter footprint of 6-60% compared to LoRA. These results suggest that constraining adaptation to minor singular directions provides a more efficient and stable mechanism for integrating new knowledge into pre-trained language models.

摘要:次要組件適應(MiCA)是一種針對大型語言模型的新型參數高效微調方法,專注於適應模型表示中的未充分利用的子空間。與傳統方法如低秩適應(LoRA)不同,後者針對主導子空間,MiCA利用奇異值分解來識別與最不重要的奇異值相關的次要奇異向量的子空間,並在微調過程中將參數更新限制在這些方向上。這一策略在優化的訓練超參數下,知識獲取的提升可達5.9倍,並且與LoRA相比,參數佔用最小為6-60%。這些結果表明,將適應限制在次要奇異方向上提供了一種更高效且穩定的機制,以將新知識整合到預訓練的語言模型中。

PRISM: Probability Reallocation with In-Span Masking for Knowledge-Sensitive Alignment

2604.01682v1 by Chenning Xu, Mao Zheng, Mingyang Song

Supervised fine-tuning (SFT) with token-level hard labels can amplify overconfident imitation of factually unsupported targets, causing hallucinations that propagate in multi-sentence generation. We study an augmented SFT setting in which training instances include coarse sentence-level factuality risk labels and inter-sentence dependency annotations, providing structured signals about where factual commitments are weakly supported. We propose \textbf{PRISM}, a differentiable risk-gated framework that modifies learning only at fact-critical positions. PRISM augments standard SFT with a lightweight, model-aware probability reallocation objective that penalizes high-confidence predictions on risky target tokens, with its scope controlled by span-level risk weights and model-aware gating. Experiments on hallucination-sensitive factual benchmarks and general evaluations show that PRISM improves factual aggregates across backbones while maintaining a competitive overall capability profile. Ablations further show that the auxiliary signal is most effective when used conservatively, and that knowledge masking and model-aware reallocation play complementary roles in balancing factual correction and capability preservation.

摘要:監督式微調(SFT)使用標記級的硬標籤可能會放大對事實上不支持目標的過度自信模仿,導致在多句生成中出現的幻覺。我們研究了一種增強的 SFT 設定,其中訓練實例包括粗略的句子級事實風險標籤和句子間依賴性註釋,提供有關事實承諾薄弱支持的結構性信號。我們提出了 \textbf{PRISM},這是一個可微分的風險閘框架,僅在事實關鍵位置修改學習。PRISM 通過一個輕量級、模型感知的概率重新分配目標來增強標準 SFT,該目標對於高置信度預測在風險目標標記上進行懲罰,其範圍由跨度級風險權重和模型感知閘控製。對於對幻覺敏感的事實基準和一般評估的實驗顯示,PRISM 在保持競爭性的整體能力配置文件的同時,改善了各個骨幹的事實聚合。消融實驗進一步顯示,輔助信號在保守使用時最為有效,並且知識遮蔽和模型感知重新分配在平衡事實修正和能力保留方面扮演互補角色。

Can Heterogeneous Language Models Be Fused?

2604.01674v1 by Shilian Chen, Jie Zhou, Qin Chen, Wen Wu, Xin Li, Qi Feng, Liang He

Model merging aims to integrate multiple expert models into a single model that inherits their complementary strengths without incurring the inference-time cost of ensembling. Recent progress has shown that merging can be highly effective when all source models are \emph{homogeneous}, i.e., derived from the same pretrained backbone and therefore share aligned parameter coordinates or compatible task vectors. Yet this assumption is increasingly unrealistic in open model ecosystems, where useful experts are often built on different families such as Llama, Qwen, and Mistral. In such \emph{heterogeneous} settings, direct weight-space fusion becomes ill-posed due to architectural mismatch, latent basis misalignment, and amplified cross-source conflict. We address this problem with \texttt{HeteroFusion} for heterogeneous language model fusion, which consists of two key components: topology-based alignment that transfers knowledge across heterogeneous backbones by matching functional module structures instead of raw tensor coordinates, and conflict-aware denoising that suppresses incompatible or noisy transfer signals during fusion. We further provide analytical justification showing that preserving the target adapter basis while predicting structured updates leads to a stable and well-conditioned transfer process. Across heterogeneous transfer, multi-source fusion, noisy-source robustness, and cross-family generalization settings, \texttt{HeteroFusion} consistently outperforms strong merging, fusion, and ensemble baselines.

摘要:模型合併旨在將多個專家模型整合為一個單一模型,該模型繼承它們的互補優勢,而不會產生集成時的推理成本。最近的進展顯示,當所有源模型是\emph{同質}時,即源自相同的預訓練骨幹,因此共享對齊的參數坐標或兼容的任務向量,合併可以非常有效。然而,在開放模型生態系統中,這一假設越來越不現實,因為有用的專家通常建立在不同的家族上,例如Llama、Qwen和Mistral。在這種\emph{異質}環境中,由於架構不匹配、潛在基礎不對齊以及放大的跨源衝突,直接的權重空間融合變得不適定。我們通過\texttt{HeteroFusion}來解決這個問題,這是一種異質語言模型融合方法,包含兩個關鍵組件:基於拓撲的對齊,通過匹配功能模塊結構而非原始張量坐標來在異質骨幹之間轉移知識,以及衝突感知去噪,在融合過程中抑制不兼容或噪聲的轉移信號。我們進一步提供分析證明,顯示在預測結構更新的同時保持目標適配器基礎會導致穩定且良好條件的轉移過程。在異質轉移、多源融合、噪聲源穩健性和跨家族泛化設置中,\texttt{HeteroFusion}始終超越強大的合併、融合和集成基準。

PRCCF: A Persona-guided Retrieval and Causal-aware Cognitive Filtering Framework for Emotional Support Conversation

2604.01671v1 by Yanxin Luo, Xiaoyu Zhang, Jing Li, Yan Gao, Donghong Han

Emotional Support Conversation (ESC) aims to alleviate individual emotional distress by generating empathetic responses. However, existing methods face challenges in effectively supporting deep contextual understanding. To address this issue, we propose PRCCF, a Persona-guided Retrieval and Causality-aware Cognitive Filtering framework. Specifically, the framework incorporates a persona-guided retrieval mechanism that jointly models semantic compatibility and persona alignment to enhance response generation. Furthermore, it employs a causality-aware cognitive filtering module to prioritize causally relevant external knowledge, thereby improving contextual cognitive understanding for emotional reasoning. Extensive experiments on the ESConv dataset demonstrate that PRCCF outperforms state-of-the-art baselines on both automatic metrics and human evaluations. Our code is publicly available at: https://github.com/YancyLyx/PRCCF.

摘要:情感支持對話(ESC)旨在通過生成同理心反應來減輕個體的情感困擾。然而,現有方法在有效支持深層次上下文理解方面面臨挑戰。為了解決這個問題,我們提出了PRCCF,一個以角色為導向的檢索和因果關聯認知過濾框架。具體而言,該框架包含一個角色導向的檢索機制,該機制共同建模語義相容性和角色對齊,以增強反應生成。此外,它還採用一個因果關聯認知過濾模塊,以優先考慮因果相關的外部知識,從而改善情感推理的上下文認知理解。在ESConv數據集上進行的大量實驗表明,PRCCF在自動指標和人類評估上均優於最先進的基準。我們的代碼已公開可用,網址為:https://github.com/YancyLyx/PRCCF。

Hierarchical Memory Orchestration for Personalized Persistent Agents

2604.01670v1 by Junming Liu, Yifei Sun, Weihua Cheng, Haodong Lei, Yuqi Li, Yirong Chen, Ding Wang

While long-term memory is essential for intelligent agents to maintain consistent historical awareness, the accumulation of extensive interaction data often leads to performance bottlenecks. Naive storage expansion increases retrieval noise and computational latency, overwhelming the reasoning capacity of models deployed on constrained personal devices. To address this, we propose Hierarchical Memory Orchestration (HMO), a framework that organizes interaction history into a three-tiered directory driven by user-centric contextual relevance. Our system maintains a compact primary cache, coupling recent and pivotal memories with an evolving user profile to ensure agent reasoning remains aligned with individual behavioral traits. This primary cache is complemented by a high-priority secondary layer, both of which are managed within a global archive of the full interaction history. Crucially, the user persona dictates memory redistribution across this hierarchy, promoting records mapped to long-term patterns toward more active tiers while relegating less relevant information. This targeted orchestration surfaces historical knowledge precisely when needed while maintaining a lean and efficient active search space. Evaluations on multiple benchmarks achieve state-of-the-art performance. Real-world deployments in ecosystems like OpenClaw demonstrate that HMO significantly enhances agent fluidity and personalization.

摘要:長期記憶對於智能代理保持一致的歷史意識至關重要,但大量互動數據的積累往往導致性能瓶頸。簡單的存儲擴展會增加檢索噪聲和計算延遲,超出受限個人設備上模型的推理能力。為了解決這個問題,我們提出了分層記憶協調(HMO),這是一個將互動歷史組織成三層目錄的框架,驅動因素是以用戶為中心的上下文相關性。我們的系統維持一個緊湊的主緩存,將近期和關鍵的記憶與不斷演變的用戶檔案結合,以確保代理的推理與個體行為特徵保持一致。這個主緩存由一個高優先級的次級層補充,這兩者都在完整互動歷史的全球檔案中進行管理。關鍵在於,用戶角色決定了這個層級中的記憶重新分配,促進與長期模式映射的記錄向更活躍的層級移動,同時將不太相關的信息降級。這種有針對性的協調在需要時準確地顯現歷史知識,同時保持精簡且高效的主動搜索空間。在多個基準上的評估達到了最先進的性能。在像OpenClaw這樣的生態系統中的實際部署顯示,HMO顯著增強了代理的流暢性和個性化。

Robust Embodied Perception in Dynamic Environments via Disentangled Weight Fusion

2604.01669v1 by Juncen Guo, Xiaoguang Zhu, Jingyi Wu, Jingyu Zhang, Jingnan Cai, Zhenghao Niu, Liang Song

Embodied perception systems face severe challenges of dynamic environment distribution drift when they continuously interact in open physical spaces. However, the existing domain incremental awareness methods often rely on the domain id obtained in advance during the testing phase, which limits their practicability in unknown interaction scenarios. At the same time, the model often overfits to the context-specific perceptual noise, which leads to insufficient generalization ability and catastrophic forgetting. To address these limitations, we propose a domain-id and exemplar-free incremental learning framework for embodied multimedia systems, which aims to achieve robust continuous environment adaptation. This method designs a disentangled representation mechanism to remove non-essential environmental style interference, and guide the model to focus on extracting semantic intrinsic features shared across scenes, thereby eliminating perceptual uncertainty and improving generalization. We further use the weight fusion strategy to dynamically integrate the old and new environment knowledge in the parameter space, so as to ensure that the model adapts to the new distribution without storing historical data and maximally retains the discrimination ability of the old environment. Extensive experiments on multiple standard benchmark datasets show that the proposed method significantly reduces catastrophic forgetting in a completely exemplar-free and domain-id free setting, and its accuracy is better than the existing state-of-the-art methods.

摘要:具身感知系統在持續與開放物理空間互動時,面臨動態環境分佈漂移的嚴重挑戰。然而,現有的領域增量感知方法通常依賴於在測試階段事先獲得的領域 ID,這限制了它們在未知互動場景中的實用性。同時,模型往往過度擬合於特定上下文的感知噪聲,導致泛化能力不足和災難性遺忘。為了解決這些限制,我們提出了一個無領域 ID 和範例的增量學習框架,旨在實現穩健的持續環境適應。該方法設計了一個解耦表示機制,以去除非必要的環境風格干擾,並引導模型專注於提取跨場景共享的語義內在特徵,從而消除感知不確定性並改善泛化。我們進一步使用權重融合策略,在參數空間中動態整合舊環境和新環境的知識,以確保模型能夠適應新分佈而不需存儲歷史數據,並最大限度地保留舊環境的區分能力。在多個標準基準數據集上的廣泛實驗表明,所提出的方法在完全無範例和無領域 ID 的設置下顯著減少了災難性遺忘,其準確性優於現有的最先進方法。

M3D-BFS: a Multi-stage Dynamic Fusion Strategy for Sample-Adaptive Multi-Modal Brain Network Analysis

2604.01667v1 by Rui Dong, Xiaotong Zhang, Jiaxing Li, Yueying Li, Jiayin Wei, Youyong Kong

Multi-modal fusion is of great significance in neuroscience which integrates information from different modalities and can achieve better performance than uni-modal methods in downstream tasks. Current multi-modal fusion methods in brain networks, which mainly focus on structural connectivity (SC) and functional connectivity (FC) modalities, are static in nature. They feed different samples into the same model with identical computation, ignoring inherent difference between input samples. This lack of sample adaptation hinders model's further performance. To this end, we innovatively propose a multi-stage dynamic fusion strategy (M3D-BFS) for sample-adaptive multi-modal brain network analysis. Unlike other static fusion methods, we design different mixture-of-experts (MoEs) for uni- and multi-modal representations where modules can adaptively change as input sample changes during inference. To alleviate issue of MoE where training of experts may be collapsed, we divide our method into 3 stages. We first train uni-modal encoders respectively, then pretrain single experts of MoEs before finally finetuning the whole model. A multi-modal disentanglement loss is designed to enhance the final representations. To the best of our knowledge, this is the first work for dynamic fusion for multi-modal brain network analysis. Extensive experiments on different real-world datasets demonstrates the superiority of M3D-BFS.

摘要:多模態融合在神經科學中具有重要意義,它整合來自不同模態的信息,並能在下游任務中實現比單模態方法更好的性能。目前在腦網絡中的多模態融合方法主要集中於結構連接(SC)和功能連接(FC)模態,這些方法本質上是靜態的。它們將不同的樣本輸入相同的模型,並使用相同的計算,忽略了輸入樣本之間的固有差異。這種缺乏樣本適應性的問題阻礙了模型的進一步性能。為此,我們創新性地提出了一種多階段動態融合策略(M3D-BFS),用於樣本自適應的多模態腦網絡分析。與其他靜態融合方法不同,我們為單模態和多模態表示設計了不同的專家混合(MoEs),這些模塊可以在推理過程中根據輸入樣本的變化自適應地改變。為了解決專家訓練可能崩潰的MoE問題,我們將我們的方法分為三個階段。我們首先分別訓練單模態編碼器,然後預訓練MoEs的單一專家,最後微調整個模型。設計了一種多模態解耦損失來增強最終表示。據我們所知,這是針對多模態腦網絡分析的動態融合的首個工作。在不同的真實世界數據集上進行的廣泛實驗證明了M3D-BFS的優越性。

CORAL: Towards Autonomous Multi-Agent Evolution for Open-Ended Discovery

2604.01658v1 by Ao Qu, Han Zheng, Zijian Zhou, Yihao Yan, Yihong Tang, Shao Yong Ong, Fenglu Hong, Kaichen Zhou, Chonghe Jiang, Minwei Kong, Jiacheng Zhu, Xuan Jiang, Sirui Li, Cathy Wu, Bryan Kian Hsiang Low, Jinhua Zhao, Paul Pu Liang

Large language model (LLM)-based evolution is a promising approach for open-ended discovery, where progress requires sustained search and knowledge accumulation. Existing methods still rely heavily on fixed heuristics and hard-coded exploration rules, which limit the autonomy of LLM agents. We present CORAL, the first framework for autonomous multi-agent evolution on open-ended problems. CORAL replaces rigid control with long-running agents that explore, reflect, and collaborate through shared persistent memory, asynchronous multi-agent execution, and heartbeat-based interventions. It also provides practical safeguards, including isolated workspaces, evaluator separation, resource management, and agent session and health management. Evaluated on diverse mathematical, algorithmic, and systems optimization tasks, CORAL sets new state-of-the-art results on 10 tasks, achieving 3-10 times higher improvement rates with far fewer evaluations than fixed evolutionary search baselines across tasks. On Anthropic's kernel engineering task, four co-evolving agents improve the best known score from 1363 to 1103 cycles. Mechanistic analyses further show how these gains arise from knowledge reuse and multi-agent exploration and communication. Together, these results suggest that greater agent autonomy and multi-agent evolution can substantially improve open-ended discovery. Code is available at https://github.com/Human-Agent-Society/CORAL.

摘要:大型語言模型(LLM)基礎的演化是一種有前景的開放式發現方法,其中進展需要持續的搜索和知識積累。現有的方法仍然在很大程度上依賴於固定的啟發式和硬編碼的探索規則,這限制了LLM代理的自主性。我們提出了CORAL,這是第一個針對開放式問題的自主多代理演化框架。CORAL用長期運行的代理取代了僵化的控制,這些代理通過共享的持久記憶、異步多代理執行和基於心跳的干預進行探索、反思和合作。它還提供了實用的安全措施,包括隔離的工作空間、評估者分離、資源管理以及代理會話和健康管理。在多樣的數學、算法和系統優化任務上進行評估,CORAL在10個任務上設置了新的最先進結果,實現了3-10倍的更高改進率,並且在任務中所需的評估次數遠少於固定的演化搜索基準。在Anthropic的核心工程任務上,四個共同演化的代理將最佳已知分數從1363改善到1103個循環。機械分析進一步顯示這些增益是如何來自知識重用和多代理的探索與通信。總體而言,這些結果表明,更大的代理自主性和多代理演化可以顯著改善開放式發現。代碼可在 https://github.com/Human-Agent-Society/CORAL 獲得。

AromaGen: Interactive Generation of Rich Olfactory Experiences with Multimodal Language Models

2604.01650v1 by Yunge Wen, Awu Chen, Jianing Yu, Jas Brooks, Hiroshi Ishii, Paul Pu Liang

Smell's deep connection with food, memory, and social experience has long motivated researchers to bring olfaction into interactive systems. Yet most olfactory interfaces remain limited to fixed scent cartridges and pre-defined generation patterns, and the scarcity of large-scale olfactory datasets has further constrained AI-based approaches. We present AromaGen, an AI-powered wearable interface capable of real-time, general-purpose aroma generation from free-form text or visual inputs. AromaGen is powered by a multimodal LLM that leverages latent olfactory knowledge to map semantic inputs to structured mixtures of 12 carefully selected base odorants, released through a neck-worn dispenser. Users can iteratively refine generated aromas through natural language feedback via in-context learning. Through a controlled user study ($N = 26$), AromaGen matches human-composed mixtures in zero-shot generation and significantly surpasses them after iterative refinement, achieving a median similarity of 8/10 to real food aromas and reducing perceived artificiality to levels comparable to real food. AromaGen is a step towards real-world interactive aroma generation, opening new possibilities for communication, wellbeing, and immersive technologies.

摘要:氣味與食物、記憶和社交體驗之間的深厚聯繫,長久以來激勵著研究者將嗅覺引入互動系統。然而,大多數嗅覺介面仍然限於固定的香味墨盒和預定義的生成模式,而大型嗅覺數據集的稀缺進一步限制了基於 AI 的方法。我們提出了 AromaGen,一種能夠從自由形式的文本或視覺輸入中實時生成通用香氣的 AI 驅動可穿戴介面。AromaGen 由一個多模態 LLM 驅動,利用潛在的嗅覺知識將語義輸入映射到 12 種精心挑選的基礎氣味的結構混合物,這些氣味通過佩戴在頸部的分配器釋放。用戶可以通過上下文學習,通過自然語言反饋來迭代地細化生成的香氣。通過一項受控的用戶研究($N = 26$),AromaGen 在零樣本生成中與人類創作的混合物相匹配,並在迭代細化後顯著超越它們,實現了與真實食物香氣的中位相似度為 8/10,並將感知的人工性降低到與真實食物相當的水平。AromaGen 是邁向現實世界互動香氣生成的一步,為溝通、福祉和沉浸式技術開啟了新的可能性。

Exploring Robust Multi-Agent Workflows for Environmental Data Management

2604.01647v1 by Boyuan Guan, Jason Liu, Yanzhao Wu, Kiavash Bahreini

Embedding LLM-driven agents into environmental FAIR data management is compelling - they can externalize operational knowledge and scale curation across heterogeneous data and evolving conventions. However, replacing deterministic components with probabilistic workflows changes the failure mode: LLM pipelines may generate plausible but incorrect outputs that pass superficial checks and propagate into irreversible actions such as DOI minting and public release. We introduce EnviSmart, a production data management system deployed on campus-wide storage infrastructure for environmental research. EnviSmart treats reliability as an architectural property through two mechanisms: a three-track knowledge architecture that externalizes behaviors (governance constraints), domain knowledge (retrievable context), and skills (tool-using procedures) as persistent, interlocking artifacts; and a role-separated multi-agent design where deterministic validators and audited handoffs restore fail-stop semantics at trust boundaries before irreversible steps. We compare two production deployments. The University's GIS Center Ecological Archive (849 curated datasets) serves as a single-agent baseline. SF2Bench, a compound flooding benchmark comprising 2,452 monitoring stations and 8,557 published files spanning 39 years, validates the multi-agent workflow. The multi-agent approach improved both efficiency - completed by a single operator in two days with repeated artifact reuse across deployments - and reliability: audited handoffs detected and blocked a coordinate transformation error affecting all 2,452 stations before publication. A representative incident (ISS-004) demonstrated boundary-based containment with 10-minute detection latency, zero user exposure, and 80-minute resolution. This paper has been accepted at PEARC 2026.

摘要:將 LLM 驅動的代理嵌入環境 FAIR 數據管理是引人注目的——它們可以外部化操作知識並在異構數據和不斷演變的規範中擴展策展。然而,用概率工作流程替換確定性組件改變了失敗模式:LLM 管道可能生成看似合理但不正確的輸出,這些輸出能通過表面檢查並傳播到不可逆的行動中,例如 DOI 鑄造和公開發布。我們介紹 EnviSmart,一個部署在校園範圍內存儲基礎設施上的生產數據管理系統,用於環境研究。EnviSmart 通過兩個機制將可靠性視為一種架構屬性:三軌知識架構外部化行為(治理約束)、領域知識(可檢索的上下文)和技能(工具使用程序)作為持久的、互鎖的工件;以及一種角色分離的多代理設計,其中確定性驗證者和經過審核的交接在不可逆步驟之前恢復信任邊界的故障停止語義。我們比較了兩個生產部署。大學的 GIS 中心生態檔案館(849 個策展數據集)作為單一代理基準。SF2Bench 是一個複合洪水基準,包含 2,452 個監測站和 8,557 個跨越 39 年的已發布文件,驗證了多代理工作流程。多代理方法提高了效率——由單一操作員在兩天內完成,並在不同部署之間重複使用工件——以及可靠性:經過審核的交接檢測並阻止了一個影響所有 2,452 個站點的坐標轉換錯誤,並在發布之前進行了阻止。一個代表性的事件(ISS-004)展示了基於邊界的封閉,檢測延遲為 10 分鐘,零用戶暴露,解決時間為 80 分鐘。本文已被 PEARC 2026 接受。

CRIT: Graph-Based Automatic Data Synthesis to Enhance Cross-Modal Multi-Hop Reasoning

2604.01634v1 by Junyoung Sung, Seungwoo Lyu, Minjun Kim, Sumin An, Arsha Nagrani, Paul Hongsuck Seo

Real-world reasoning often requires combining information across modalities, connecting textual context with visual cues in a multi-hop process. Yet, most multimodal benchmarks fail to capture this ability: they typically rely on single images or set of images, where answers can be inferred from a single modality alone. This limitation is mirrored in the training data, where interleaved image-text content rarely enforces complementary, multi-hop reasoning. As a result, Vision-Language Models (VLMs) frequently hallucinate and produce reasoning traces poorly grounded in visual evidence. To address this gap, we introduce CRIT, a new dataset and benchmark built with a graph-based automatic pipeline for generating complex cross-modal reasoning tasks. CRIT consists of diverse domains ranging from natural images, videos, and text-rich sources, and includes a manually verified test set for reliable evaluation. Experiments on this benchmark reveal that even state-of-the-art models struggle on such reasoning tasks. Models trained on CRIT show significant gains in cross-modal multi-hop reasoning, including strong improvements on SPIQA and other standard multimodal benchmarks.

摘要:現實世界的推理通常需要跨模態結合資訊,在多跳過程中將文本上下文與視覺線索連接起來。然而,大多數多模態基準無法捕捉這種能力:它們通常依賴單一圖像或圖像集,答案可以僅從單一模態推斷出來。這一限制在訓練數據中也得到了反映,其中交錯的圖像-文本內容很少強調互補的多跳推理。因此,視覺-語言模型(VLMs)經常產生幻覺,並生成與視覺證據聯繫不良的推理痕跡。為了解決這一差距,我們介紹了CRIT,一個新的數據集和基準,通過基於圖形的自動管道生成複雜的跨模態推理任務。CRIT包含來自自然圖像、視頻和文本豐富來源的多樣領域,並包括一個經過手動驗證的測試集,以便進行可靠的評估。在這個基準上的實驗顯示,即使是最先進的模型在這類推理任務上也面臨挑戰。在CRIT上訓練的模型在跨模態多跳推理上顯示出顯著的增益,包括在SPIQA和其他標準多模態基準上的強勁改進。

GraphWalk: Enabling Reasoning in Large Language Models through Tool-Based Graph Navigation

2604.01610v1 by Taraneh Ghandi, Hamidreza Mahyar, Shachar Klaiman

The use of knowledge graphs for grounding agents in real-world Q&A applications has become increasingly common. Answering complex queries often requires multi-hop reasoning and the ability to navigate vast relational structures. Standard approaches rely on prompting techniques that steer large language models to reason over raw graph context, or retrieval-augmented generation pipelines where relevant subgraphs are injected into the context. These, however, face severe limitations with enterprise-scale KGs that cannot fit in even the largest context windows available today. We present GraphWalk, a problem-agnostic, training-free, tool-based framework that allows off-the-shelf LLMs to reason through sequential graph navigation, dramatically increasing performance across different tasks. Unlike task-specific agent frameworks that encode domain knowledge into specialized tools, GraphWalk equips the LLM with a minimal set of orthogonal graph operations sufficient to traverse any graph structure. We evaluate whether models equipped with GraphWalk can compose these operations into correct multi-step reasoning chains, where each tool call represents a verifiable step creating a transparent execution trace. We first demonstrate our approach on maze traversal, a problem non-reasoning models are completely unable to solve, then present results on graphs resembling real-world enterprise knowledge graphs. To isolate structural reasoning from world knowledge, we evaluate on entirely synthetic graphs with random, non-semantic labels. Our benchmark spans 12 query templates from basic retrieval to compound first-order logic queries. Results show that tool-based traversal yields substantial and consistent gains over in-context baselines across all model families tested, with gains becoming more pronounced as scale increases, precisely where in-context approaches fail catastrophically.

摘要:使用知識圖譜為現實世界的問答應用程序提供基礎的做法變得越來越普遍。回答複雜的查詢通常需要多步推理和導航龐大關係結構的能力。標準方法依賴於引導技術,這些技術使大型語言模型能夠在原始圖譜上下文中進行推理,或者使用檢索增強生成管道,將相關子圖注入上下文中。然而,這些方法在企業級知識圖譜面臨嚴重限制,這些知識圖譜甚至無法適應當前可用的最大上下文窗口。我們提出了GraphWalk,一種與問題無關、無需訓練的基於工具的框架,允許現成的LLM通過順序圖導航進行推理,顯著提高不同任務的性能。與將領域知識編碼到專用工具中的任務特定代理框架不同,GraphWalk為LLM提供了一組最小的正交圖操作,足以遍歷任何圖結構。我們評估裝備GraphWalk的模型是否能夠將這些操作組合成正確的多步推理鏈,其中每個工具調用代表一個可驗證的步驟,創建一個透明的執行痕跡。我們首先在迷宮遍歷問題上演示我們的方法,這是一個非推理模型完全無法解決的問題,然後呈現類似於現實世界企業知識圖譜的圖形結果。為了將結構推理與世界知識隔離,我們在完全合成的圖上進行評估,這些圖具有隨機的非語義標籤。我們的基準涵蓋了從基本檢索到複合一階邏輯查詢的12個查詢模板。結果顯示,基於工具的遍歷在所有測試的模型系列中,相較於上下文基準產生了顯著且一致的增益,隨著規模的增加,增益變得更加明顯,正是上下文方法在此時慘遭失敗的地方。

From Multi-Agent to Single-Agent: When Is Skill Distillation Beneficial?

2604.01608v1 by Binyan Xu, Dong Fang, Haitao Li, Kehuan Zhang

Multi-agent systems (MAS) tackle complex tasks by distributing expertise, though this often comes at the cost of heavy coordination overhead, context fragmentation, and brittle phase ordering. Distilling a MAS into a single-agent skill can bypass these costs, but this conversion lacks a principled answer for when and what to distill. Instead, the empirical outcome is surprisingly inconsistent: skill lift ranges from a 28% improvement to a 2% degradation across metrics of the exact same task. In this work, we reveal that skill utility is governed not by the task, but by the evaluation metric. We introduce Metric Freedom ($F$), the first a priori predictor of skill utility. $F$ measures the topological rigidity of a metric's scoring landscape by quantifying how output diversity couples with score variance via a Mantel test. Guided by $F$, we propose a two-stage adaptive distillation framework. Stage 1 acts as a selective extraction mechanism, extracting tools and knowledge while discarding restrictive structures on "free" metrics to preserve exploration. Stage 2 targets computationally intensive iterative refinement exclusively toward "rigid" metrics ($F \lesssim 0.6$) to eliminate trajectory-local overfitting. Evaluating across 4 tasks, 11 datasets, and 6 metrics, $F$ strongly predicts skill utility ($ρ= -0.62$, $p < 0.05$). Strikingly, identical agent trajectories yield diametrically opposite skill lifts under rigid versus free metrics, demonstrating that skill utility is fundamentally a metric-level property. Driven by this signal, our adaptive agent matches or exceeds the original MAS while reducing cost up to 8$\times$ and latency by up to 15$\times$.

摘要:多智能體系統(MAS)透過分配專業知識來處理複雜任務,儘管這通常會帶來大量的協調開銷、上下文碎片化和脆弱的階段排序。將MAS提煉為單一智能體技能可以繞過這些成本,但這種轉換缺乏一個原則性的答案來決定何時以及提煉什麼。相反,實證結果令人驚訝地不一致:技能提升在相同任務的不同指標中範圍從28%的改善到2%的劣化。在這項工作中,我們揭示了技能效用不是由任務決定的,而是由評估指標決定的。我們引入了指標自由($F$),這是第一個先驗的技能效用預測器。$F$通過量化輸出多樣性如何與分數變異性耦合來測量指標評分景觀的拓撲剛性,這是通過Mantel測試實現的。在$F$的指導下,我們提出了一個兩階段的自適應提煉框架。第一階段作為一個選擇性提取機制,提取工具和知識,同時丟棄對「自由」指標的限制性結構,以保留探索。第二階段專門針對計算密集型的迭代精煉,僅針對「剛性」指標($F \lesssim 0.6$)以消除軌跡局部過擬合。在4個任務、11個數據集和6個指標中進行評估,$F$強烈預測技能效用($ρ= -0.62$,$p < 0.05$)。驚人的是,在剛性與自由指標下,相同的智能體軌跡產生了截然相反的技能提升,這表明技能效用從根本上是一個指標級別的特性。在這個信號的驅動下,我們的自適應智能體在降低成本高達8倍和延遲高達15倍的同時,達到或超過了原始的MAS。

ByteRover: Agent-Native Memory Through LLM-Curated Hierarchical Context

2604.01599v1 by Andy Nguyen, Danh Doan, Hoang Pham, Bao Ha, Dat Pham, Linh Nguyen, Hieu Nguyen, Thien Nguyen, Cuong Do, Phat Nguyen, Toan Nguyen

Memory-Augmented Generation (MAG) extends large language models with external memory to support long-context reasoning, but existing approaches universally treat memory as an external service that agents call into, delegating storage to separate pipelines of chunking, embedding, and graph extraction. This architectural separation means the system that stores knowledge does not understand it, leading to semantic drift between what the agent intended to remember and what the pipeline actually captured, loss of coordination context across agents, and fragile recovery after failures. In this paper, we propose ByteRover, an agent-native memory architecture that inverts the memory pipeline: the same LLM that reasons about a task also curates, structures, and retrieves knowledge. ByteRover represents knowledge in a hierarchical Context Tree, a file-based knowledge graph organized as Domain, Topic, Subtopic, and Entry, where each entry carries explicit relations, provenance, and an Adaptive Knowledge Lifecycle (AKL) with importance scoring, maturity tiers, and recency decay. Retrieval uses a 5-tier progressive strategy that resolves most queries at sub-100 ms latency without LLM calls, escalating to agentic reasoning only for novel questions. Experiments on LoCoMo and LongMemEval demonstrate that ByteRover achieves state-of-the-art accuracy on LoCoMo and competitive results on LongMemEval while requiring zero external infrastructure, no vector database, no graph database, no embedding service, with all knowledge stored as human-readable markdown files on the local filesystem.

摘要:記憶增強生成(MAG)通過外部記憶擴展大型語言模型,以支持長上下文推理,但現有的方法普遍將記憶視為代理調用的外部服務,將存儲委派給分開的切片、嵌入和圖形提取管道。這種架構的分離意味著存儲知識的系統並不理解它,導致代理想要記住的內容與管道實際捕獲的內容之間出現語義漂移,代理之間的協調上下文丟失,以及在故障後的脆弱恢復。在本文中,我們提出了ByteRover,一種代理本地的記憶架構,顛覆了記憶管道:同一個LLM不僅推理任務,還策劃、結構化和檢索知識。ByteRover以層次化的上下文樹表示知識,這是一個基於文件的知識圖,按領域、主題、副主題和條目組織,其中每個條目都攜帶明確的關係、來源,以及具有重要性評分、成熟度層級和近期衰減的自適應知識生命周期(AKL)。檢索使用5層漸進策略,能在低於100毫秒的延遲內解決大多數查詢,而無需LLM調用,僅在面對新問題時升級到代理推理。在LoCoMo和LongMemEval上的實驗表明,ByteRover在LoCoMo上達到了最先進的準確性,在LongMemEval上也取得了競爭性的結果,同時不需要任何外部基礎設施,無需向量數據庫、圖形數據庫、嵌入服務,所有知識均以人類可讀的Markdown文件形式存儲在本地文件系統中。

Do Large Language Models Mentalize When They Teach?

2604.01594v1 by Sevan K. Harootonian, Mark K. Ho, Thomas L. Griffiths, Yael Niv, Ilia Sucholutsky

How do LLMs decide what to teach next: by reasoning about a learner's knowledge, or by using simpler rules of thumb? We test this in a controlled task previously used to study human teaching strategies. On each trial, a teacher LLM sees a hypothetical learner's trajectory through a reward-annotated directed graph and must reveal a single edge so the learner would choose a better path if they replanned. We run a range of LLMs as simulated teachers and fit their trial-by-trial choices with the same cognitive models used for humans: a Bayes-Optimal teacher that infers which transitions the learner is missing (inverse planning), weaker Bayesian variants, heuristic baselines (e.g., reward based), and non-mentalizing utility models. In a baseline experiment matched to the stimuli presented to human subjects, most LLMs perform well, show little change in strategy over trials, and their graph-by-graph performance is similar to that of humans. Model comparison (BIC) shows that Bayes-Optimal teaching best explains most models' choices. When given a scaffolding intervention, models follow auxiliary inference- or reward-focused prompts, but these scaffolds do not reliably improve later teaching on heuristic-incongruent test graphs and can sometimes reduce performance. Overall, cognitive model fits provide insight into LLM tutoring policies and show that prompt compliance does not guarantee better teaching decisions.

摘要:如何決定 LLM 接下來要教什麼:是通過推理學習者的知識,還是使用更簡單的經驗法則?我們在一個受控任務中測試這一點,該任務之前用於研究人類教學策略。在每次試驗中,教師 LLM 看到一個假設學習者在一個獎勵標註的有向圖中的軌跡,並必須揭示一條邊,以便學習者在重新規劃時能選擇更好的路徑。我們運行一系列 LLM 作為模擬教師,並用與人類相同的認知模型來擬合他們的逐次選擇:一個貝葉斯最優教師,推斷學習者缺失的轉換(逆向規劃)、較弱的貝葉斯變體、啟發式基準(例如,基於獎勵的)和非心理化的效用模型。在與呈現給人類受試者的刺激相匹配的基準實驗中,大多數 LLM 表現良好,策略在試驗中變化不大,且它們的圖對圖表現與人類相似。模型比較(BIC)顯示,貝葉斯最優教學最能解釋大多數模型的選擇。當給予一個支架干預時,模型遵循輔助推理或獎勵為重點的提示,但這些支架並不可靠地改善在與啟發式不一致的測試圖上的後續教學,有時甚至會降低表現。總體而言,認知模型擬合提供了對 LLM 輔導政策的洞察,並顯示提示遵從並不保證更好的教學決策。

A Role-Based LLM Framework for Structured Information Extraction from Healthy Food Policies

2604.01529v1 by Congjing Zhang, Ruoxuan Bao, Jingyu Li, Yoav Ackerman, Shuai Huang, Yanfang Su

Current Large Language Model (LLM) approaches for information extraction (IE) in the healthy food policy domain are often hindered by various factors, including misinformation, specifically hallucinations, misclassifications, and omissions that result from the structural diversity and inconsistency of policy documents. To address these limitations, this study proposes a role-based LLM framework that automates the IE from unstructured policy data by assigning specialized roles: an LLM policy analyst for metadata and mechanism classification, an LLM legal strategy specialist for identifying complex legal approaches, and an LLM food system expert for categorizing food system stages. This framework mimics expert analysis workflows by incorporating structured domain knowledge, including explicit definitions of legal mechanisms and classification criteria, into role-specific prompts. We evaluate the framework using 608 healthy food policies from the Healthy Food Policy Project (HFPP) database, comparing its performance against zero-shot, few-shot, and chain-of-thought (CoT) baselines using Llama-3.3-70B. Our proposed framework demonstrates superior performance in complex reasoning tasks, offering a reliable and transparent methodology for automating IE from health policies.

摘要:目前在健康食品政策領域中,針對信息提取(IE)的大型語言模型(LLM)方法常常受到各種因素的阻礙,包括錯誤信息,特別是幻覺、錯誤分類以及由於政策文件的結構多樣性和不一致性而導致的遺漏。為了解決這些限制,本研究提出了一個基於角色的LLM框架,通過分配專門角色來自動化從非結構化政策數據中提取信息:一個LLM政策分析師負責元數據和機制分類,一個LLM法律策略專家負責識別複雜的法律方法,以及一個LLM食品系統專家負責對食品系統階段進行分類。該框架通過將結構化的領域知識納入角色特定的提示,模仿專家分析工作流程,包括法律機制和分類標準的明確定義。我們使用來自健康食品政策項目(HFPP)數據庫的608個健康食品政策來評估該框架,並將其性能與使用Llama-3.3-70B的零樣本、少樣本和思維鏈(CoT)基準進行比較。我們提出的框架在複雜推理任務中顯示出卓越的性能,提供了一種可靠且透明的方法來自動化從健康政策中提取信息。

A Self-Evolving Agentic Framework for Metasurface Inverse Design

2604.01480v1 by Yi Huang, Bowen Zheng, Yunxi Dong, Hong Tang, Huan Zhao, S. M. Rakibul Hasan Shawon, Hualiang Zhang

Metasurface inverse design has become central to realizing complex optical functionality, yet translating target responses into executable, solver-compatible workflows still demands specialized expertise in computational electromagnetics and solver-specific software engineering. Recent large language models (LLMs) offer a complementary route to reducing this workflow-construction burden, but existing language-driven systems remain largely session-bounded and do not preserve reusable workflow knowledge across inverse-design tasks. We present an agentic framework for metasurface inverse design that addresses this limitation through context-level skill evolution. The framework couples a coding agent, evolving skill artifacts, and a deterministic evaluator grounded in physical simulation so that solver-specific strategies can be iteratively refined across tasks without modifying model weights or the underlying physics solver. We evaluate the framework on a benchmark spanning multiple metasurface inverse-design task types, with separate training-aligned and held-out task families. Evolved skills raise in-distribution task success from 38% to 74%, increase criteria pass fraction from 0.510 to 0.870, and reduce average attempts from 4.10 to 2.30. On held-out task families, binary success changes only marginally, but improvements in best margin together with shifts in error composition and agent behavior indicate partial transfer of workflow knowledge. These results suggest that the main value of skill evolution lies in accumulating reusable solver-specific expertise around reliable computational engines, thereby offering a practical path toward more autonomous and accessible metasurface inverse-design workflows.

摘要:超表面反向設計已成為實現複雜光學功能的核心,但將目標響應轉換為可執行的、與求解器兼容的工作流程仍然需要在計算電磁學和求解器特定軟體工程方面的專業知識。最近的大型語言模型(LLMs)提供了一種補充途徑,以減少這一工作流程構建的負擔,但現有的語言驅動系統仍然在很大程度上受到會話的限制,並且無法在反向設計任務之間保留可重用的工作流程知識。我們提出了一個針對超表面反向設計的代理框架,通過上下文級技能演變來解決這一限制。該框架結合了一個編碼代理、演變的技能工件和基於物理模擬的確定性評估器,以便在不修改模型權重或基礎物理求解器的情況下,跨任務迭代地完善求解器特定策略。我們在涵蓋多種超表面反向設計任務類型的基準上評估該框架,並設有單獨的訓練對齊和保留的任務家族。演變的技能使得分佈內任務的成功率從38%提高到74%,標準通過比例從0.510增加到0.870,平均嘗試次數從4.10減少到2.30。在保留的任務家族中,二元成功僅有輕微變化,但最佳邊際的改善以及錯誤組成和代理行為的變化表明工作流程知識的部分轉移。這些結果表明,技能演變的主要價值在於積累圍繞可靠計算引擎的可重用求解器特定專業知識,從而為更自主和更易於訪問的超表面反向設計工作流程提供了一條實用的途徑。

2604.01379v1 by Fan Huang, Munjung Kim

Can large language models (LLMs) predict which researchers will collaborate? We study this question through link prediction on real-world co-authorship networks from OpenAlex (9.96M authors, 108.7M edges), evaluating whether LLMs can predict future scientific collaborations using only author profiles, without access to graph structure. Using Qwen2.5-72B-Instruct across three historical eras of AI research, we find that LLMs and topology heuristics capture distinct signals and are strongest in complementary settings. On new-edge prediction under natural class imbalance, the LLM achieves AUROC 0.714--0.789, outperforming Common Neighbors, Jaccard, and Preferential Attachment, with recall up to 92.9\%; under balanced evaluation, the LLM outperforms \emph{all} topology heuristics in every era (AUROC 0.601--0.658 vs.\ best-heuristic 0.525--0.538); on continued edges, the LLM (0.687) is competitive with Adamic-Adar (0.684). Critically, 78.6--82.7\% of new collaborations occur between authors with no common neighbor -- a blind spot where all topology heuristics score zero but the LLM still achieves AUROC 0.652 by reasoning from author metadata alone. A temporal metadata ablation reveals that research concepts are the dominant signal (removing concepts drops AUROC by 0.047--0.084). Providing pre-computed graph features to the LLM \emph{degrades} performance due to anchoring effects, confirming that LLMs and topology methods should operate as separate, complementary channels. A socio-cultural ablation finds that name-inferred ethnicity and institutional country do not predict collaboration beyond topology, reflecting the demographic homogeneity of AI research. A node2vec baseline achieves AUROC comparable to Adamic-Adar, establishing that LLMs access a fundamentally different information channel -- author metadata -- rather than encoding the same structural signal differently.

摘要:大型語言模型(LLMs)能預測哪些研究者將會合作嗎?我們通過對來自OpenAlex的真實共著網絡(996萬作者,1.087億邊)的鏈接預測來研究這個問題,評估LLMs是否能僅通過作者檔案預測未來的科學合作,而不需要訪問圖結構。在三個AI研究的歷史時期中使用Qwen2.5-72B-Instruct,我們發現LLMs和拓撲啟發式方法捕捉到不同的信號,並在互補的環境中最為強大。在自然類別不平衡下的新邊預測中,LLM達到AUROC 0.714--0.789,超越了Common Neighbors、Jaccard和Preferential Attachment,召回率高達92.9%;在平衡評估下,LLM在每個時期中超越了\emph{所有}拓撲啟發式方法(AUROC 0.601--0.658對比最佳啟發式0.525--0.538);在持續邊上,LLM(0.687)與Adamic-Adar(0.684)競爭激烈。關鍵是,78.6--82.7\%的新合作發生在沒有共同鄰居的作者之間——這是一個盲點,所有拓撲啟發式方法得分為零,但LLM仍然通過僅依賴作者元數據達到AUROC 0.652。一項時間元數據消融實驗顯示,研究概念是主要信號(移除概念使AUROC下降0.047--0.084)。向LLM提供預計算的圖特徵會\emph{降低}性能,因為錨定效應,確認LLMs和拓撲方法應作為獨立的互補通道運作。一項社會文化消融實驗發現,根據姓名推斷的族裔和機構國家無法在拓撲之外預測合作,反映了AI研究的人口同質性。一個node2vec基準達到與Adamic-Adar相當的AUROC,確立了LLMs訪問一個根本不同的信息通道——作者元數據——而不是以不同方式編碼相同的結構信號。

No Attacker Needed: Unintentional Cross-User Contamination in Shared-State LLM Agents

2604.01350v1 by Tiankai Yang, Jiate Li, Yi Nian, Shen Dong, Ruiyao Xu, Ryan Rossi, Kaize Ding, Yue Zhao

LLM-based agents increasingly operate across repeated sessions, maintaining task states to ensure continuity. In many deployments, a single agent serves multiple users within a team or organization, reusing a shared knowledge layer across user identities. This shared persistence expands the failure surface: information that is locally valid for one user can silently degrade another user's outcome when the agent reapplies it without regard for scope. We refer to this failure mode as unintentional cross-user contamination (UCC). Unlike adversarial memory poisoning, UCC requires no attacker; it arises from benign interactions whose scope-bound artifacts persist and are later misapplied. We formalize UCC through a controlled evaluation protocol, introduce a taxonomy of three contamination types, and evaluate the problem in two shared-state mechanisms. Under raw shared state, benign interactions alone produce contamination rates of 57--71%. A write-time sanitization is effective when shared state is conversational, but leaves substantial residual risk when shared state includes executable artifacts, with contamination often manifesting as silent wrong answers. These results indicate that shared-state agents need artifact-level defenses beyond text-level sanitization to prevent silent cross-user failures.

摘要:LLM-based agents 越來越多地在重複的會話中運作,維持任務狀態以確保連續性。在許多部署中,單一代理為團隊或組織中的多個用戶服務,重用跨用戶身份的共享知識層。這種共享的持久性擴大了失敗的範圍:對於一個用戶來說在本地有效的信息,當代理在不考慮範圍的情況下重新應用時,可能會無聲地降低另一個用戶的結果。我們將這種失敗模式稱為無意的跨用戶污染(UCC)。與對抗性記憶中毒不同,UCC 不需要攻擊者;它源於範圍受限的善意互動,其產物持續存在並在後續被錯誤應用。我們通過一個受控評估協議對 UCC 進行形式化,介紹三種類型的污染分類法,並在兩種共享狀態機制中評估該問題。在原始共享狀態下,僅善意互動就會產生 57--71% 的污染率。當共享狀態是對話式時,寫入時的清理是有效的,但當共享狀態包括可執行的產物時,則留下了相當大的殘餘風險,污染往往表現為無聲的錯誤答案。這些結果表明,共享狀態代理需要超越文本級清理的產物級防禦,以防止無聲的跨用戶失敗。

Procedural Knowledge at Scale Improves Reasoning

2604.01348v1 by Di Wu, Devendra Singh Sachan, Wen-tau Yih, Mingda Chen

Test-time scaling has emerged as an effective way to improve language models on challenging reasoning tasks. However, most existing methods treat each problem in isolation and do not systematically reuse knowledge from prior reasoning trajectories. In particular, they underutilize procedural knowledge: how to reframe a problem, choose an approach, and verify or backtrack when needed. We introduce Reasoning Memory, a retrieval-augmented generation (RAG) framework for reasoning models that explicitly retrieves and reuses procedural knowledge at scale. Starting from existing corpora of step-by-step reasoning trajectories, we decompose each trajectory into self-contained subquestion-subroutine pairs, yielding a datastore of 32 million compact procedural knowledge entries. At inference time, a lightweight in-thought prompt lets the model verbalize the core subquestion, retrieve relevant subroutines within its reasoning trace, and reason under diverse retrieved subroutines as implicit procedural priors. Across six math, science, and coding benchmarks, Reasoning Memory consistently outperforms RAG with document, trajectory, and template knowledge, as well as a compute-matched test-time scaling baseline. With a higher inference budget, it improves over no retrieval by up to 19.2% and over the strongest compute-matched baseline by 7.9% across task types. Ablation studies show that these gains come from two key factors: the broad procedural coverage of the source trajectories and our decomposition and retrieval design, which together enable effective extraction and reuse of procedural knowledge.

摘要:測試時間擴展已成為改善語言模型在挑戰性推理任務中的有效方法。然而,大多數現有的方法將每個問題孤立處理,並未系統性地重用先前推理軌跡中的知識。特別是,它們未充分利用程序性知識:如何重新框定問題、選擇方法,以及在需要時進行驗證或回溯。我們介紹了推理記憶,一種增強檢索生成(RAG)框架,專為推理模型設計,能夠在大規模上明確檢索和重用程序性知識。從現有的逐步推理軌跡語料庫開始,我們將每個軌跡分解為自包含的子問題-子程序對,產生了3200萬個緊湊的程序性知識條目數據庫。在推理時,一個輕量級的思考提示讓模型能夠口頭表達核心子問題,檢索其推理痕跡中的相關子程序,並在多樣的檢索子程序下進行推理,作為隱含的程序性先驗。在六個數學、科學和編程基準測試中,推理記憶始終優於帶有文檔、軌跡和模板知識的RAG,以及計算匹配的測試時間擴展基準。在更高的推理預算下,它在沒有檢索的情況下提高了最多19.2%,並在最強的計算匹配基準上提高了7.9%,涵蓋了各種任務類型。消融研究顯示,這些增益來自兩個關鍵因素:源軌跡的廣泛程序性覆蓋以及我們的分解和檢索設計,這兩者共同實現了程序性知識的有效提取和重用。

IDEA2: Expert-in-the-loop competency question elicitation for collaborative ontology engineering

2604.01344v1 by Elliott Watkiss-Leek, Reham Alharbi, Harry Rostron, Andrew Ng, Ewan Johnson, Andrew Mitchell, Terry R. Payne, Valentina Tamma, Jacopo de Berardinis

Competency question (CQ) elicitation represents a critical but resource-intensive bottleneck in ontology engineering. This foundational phase is often hampered by the communication gap between domain experts, who possess the necessary knowledge, and ontology engineers, who formalise it. This paper introduces IDEA2, a novel, semi-automated workflow that integrates Large Language Models (LLMs) within a collaborative, expert-in-the-loop process to address this challenge. The methodology is characterised by a core iterative loop: an initial LLM-based extraction of CQs from requirement documents, a co-creational review and feedback phase by domain experts on an accessible collaborative platform, and an iterative, feedback-driven reformulation of rejected CQs by an LLM until consensus is achieved. To ensure transparency and reproducibility, the entire lifecycle of each CQ is tracked using a provenance model that captures the full lineage of edits, anonymised feedback, and generation parameters. The workflow was validated in 2 real-world scenarios (scientific data, cultural heritage), demonstrating that IDEA2 can accelerate the requirements engineering process, improve the acceptance and relevance of the resulting CQs, and exhibit high usability and effectiveness among domain experts. We release all code and experiments at https://github.com/KE-UniLiv/IDEA2

摘要:能力問題(CQ)的引出代表了本體工程中一個關鍵但資源密集的瓶頸。這一基礎階段常常受到領域專家(擁有必要知識)與本體工程師(將其形式化)之間的溝通障礙的阻礙。本文介紹了IDEA2,一種新穎的半自動工作流程,將大型語言模型(LLMs)整合進一個協作的專家參與過程中,以應對這一挑戰。該方法的特點是核心的迭代循環:從需求文件中基於LLM的初步提取CQ,領域專家在一個可訪問的協作平台上進行共同創作的審查和反饋階段,以及LLM對被拒絕CQ的迭代、基於反饋的重新表述,直到達成共識。為了確保透明度和可重複性,每個CQ的整個生命周期都使用一個來源模型進行跟踪,該模型捕捉了編輯的完整來源、匿名反饋和生成參數。該工作流程在兩個現實世界場景中(科學數據、文化遺產)進行了驗證,證明IDEA2可以加速需求工程過程,提高結果CQ的接受度和相關性,並在領域專家中展現出高可用性和有效性。我們在 https://github.com/KE-UniLiv/IDEA2 上發布了所有代碼和實驗。

Look Twice: Training-Free Evidence Highlighting in Multimodal Large Language Models

2604.01280v1 by Marco Morini, Sara Sarto, Marcella Cornia, Lorenzo Baraldi

Answering questions about images often requires combining visual understanding with external knowledge. Multimodal Large Language Models (MLLMs) provide a natural framework for this setting, but they often struggle to identify the most relevant visual and textual evidence when answering knowledge-intensive queries. In such scenarios, models must integrate visual cues with retrieved textual evidence that is often noisy or only partially relevant, while also localizing fine-grained visual information in the image. In this work, we introduce Look Twice (LoT), a training-free inference-time framework that improves how pretrained MLLMs utilize multimodal evidence. Specifically, we exploit the model attention patterns to estimate which visual regions and retrieved textual elements are relevant to a query, and then generate the answer conditioned on this highlighted evidence. The selected cues are highlighted through lightweight prompt-level markers that encourage the model to re-attend to the relevant evidence during generation. Experiments across multiple knowledge-based VQA benchmarks show consistent improvements over zero-shot MLLMs. Additional evaluations on vision-centric and hallucination-oriented benchmarks further demonstrate that visual evidence highlighting alone improves model performance in settings without textual context, all without additional training or architectural modifications. Source code will be publicly released.

摘要:回答有關圖像的問題通常需要將視覺理解與外部知識相結合。多模態大型語言模型(MLLMs)為這種情境提供了一個自然的框架,但它們在回答知識密集型查詢時,經常難以識別最相關的視覺和文本證據。在這種情況下,模型必須將視覺線索與檢索到的文本證據結合起來,而這些文本證據通常是嘈雜的或僅部分相關,同時還要在圖像中定位細緻的視覺信息。在本研究中,我們介紹了 Look Twice(LoT),這是一個無需訓練的推理時框架,改善了預訓練 MLLMs 如何利用多模態證據。具體而言,我們利用模型的注意力模式來估計哪些視覺區域和檢索到的文本元素與查詢相關,然後根據這些突出的證據生成答案。所選的線索通過輕量級的提示級標記突出,鼓勵模型在生成過程中重新關注相關證據。在多個基於知識的 VQA 基準上的實驗顯示,與零樣本 MLLMs 相比,性能有了一致的提升。對以視覺為中心和以幻覺為導向的基準的額外評估進一步證明,僅僅突出視覺證據就能在沒有文本上下文的情況下改善模型性能,所有這些都不需要額外的訓練或架構修改。源代碼將公開發布。

Brainstacks: Cross-Domain Cognitive Capabilities via Frozen MoE-LoRA Stacks for Continual LLM Learning

2604.01152v1 by Mohammad R. Abu Ayyash

We present Brainstacks, a modular architecture for continual multi-domain fine-tuning of large language models that packages domain expertise as frozen adapter stacks composing additively on a shared frozen base at inference. Five interlocking components: (1) MoE-LoRA with Shazeer-style noisy top-2 routing across all seven transformer projections under QLoRA 4-bit quantization with rsLoRA scaling; (2) an inner loop performing residual boosting by freezing trained stacks and adding new ones; (3) an outer loop training sequential domain-specific stacks with curriculum-ordered dependencies; (4) null-space projection via randomized SVD constraining new stacks to subspaces orthogonal to prior directions, achieving zero forgetting in isolation; (5) an outcome-based sigmoid meta-router trained on empirically discovered domain-combination targets that selectively weights stacks, enabling cross-domain composition. Two boundary experiments: (6) PSN pretraining on a randomly initialized model; (7) per-domain RL (DPO/GRPO) validating compatibility with post-SFT alignment. Validated on TinyLlama-1.1B (4 domains, 9 stacks) and Gemma 3 12B IT (5 domains, 10 stacks), MoE-LoRA achieves 2.5x faster convergence than parameter-matched single LoRA, residual boosting breaks through the single-stack ceiling, and the routed system recovers generation quality destroyed by ungated stack accumulation. The central finding: the outcome-based router discovers that domain stacks encode transferable cognitive primitives (instruction-following clarity, numerical reasoning, procedural logic, chain-of-thought structure) rather than domain-specific knowledge, with medical prompts routing to chat+math stacks in 97% of cases despite zero medical data in those stacks.

摘要:我們提出了 Brainstacks,一種模組化架構,用於大型語言模型的持續多領域微調,將領域專業知識打包為凍結的適配器堆疊,這些堆疊在推理時在共享的凍結基礎上進行加法組合。五個相互交織的組件:(1) MoE-LoRA,使用 Shazeer 風格的噪聲 top-2 路由,跨越所有七個Transformer投影,在 QLoRA 4 位量化下,並使用 rsLoRA 縮放;(2) 內部循環通過凍結訓練堆疊並添加新的堆疊來執行殘差增強;(3) 外部循環訓練具有課程排序依賴關係的序列領域特定堆疊;(4) 通過隨機 SVD 的零空間投影,將新的堆疊約束到與先前方向正交的子空間,實現零遺忘;(5) 基於結果的 sigmoid 元路由器,根據經驗發現的領域組合目標進行訓練,選擇性地加權堆疊,使跨領域組合成為可能。兩個邊界實驗:(6) 在隨機初始化的模型上進行 PSN 預訓練;(7) 每個領域的強化學習(DPO/GRPO)驗證與後 SFT 對齊的兼容性。在 TinyLlama-1.1B(4 個領域,9 個堆疊)和 Gemma 3 12B IT(5 個領域,10 個堆疊)上進行驗證,MoE-LoRA 的收斂速度比參數匹配的單一 LoRA 快 2.5 倍,殘差增強突破了單堆疊的天花板,路由系統恢復了因無閘堆疊累積而損失的生成質量。核心發現:基於結果的路由器發現領域堆疊編碼了可轉移的認知原語(遵循指令的清晰度、數字推理、程序邏輯、思維鏈結構),而不是特定於領域的知識,儘管這些堆疊中沒有醫療數據,但醫療提示在 97% 的情況下路由到 chat+math 堆疊。

Looking into a Pixel by Nonlinear Unmixing -- A Generative Approach

2604.01141v1 by Maofeng Tang, Hairong Qi

Due to the large footprint of pixels in remote sensing imagery, hyperspectral unmixing (HU) has become an important and necessary procedure in hyperspectral image analysis. Traditional HU methods rely on a prior spectral mixing model, especially for nonlinear mixtures, which has largely limited the performance and generalization capacity of the unmixing approach. In this paper, we address the challenging problem of hyperspectral nonlinear unmixing (HNU) without explicit knowledge of the mixing model. Inspired by the principle of generative models, where images of the same distribution can be generated as that of the training images without knowing the exact probability distribution function of the image, we develop an invertible mixing-unmixing process via a bi-directional GAN framework, constrained by both the cycle consistency and the linkage between linear and nonlinear mixtures. The combination of cycle consistency and linear linkage provides powerful constraints without requiring an explicit mixing model. We refer to the proposed approach as the linearly-constrained CycleGAN unmixing net, or LCGU net. Experimental results indicate that the proposed LCGU net exhibits stable and competitive performance across different datasets compared with other state-of-the-art model-based HNU methods.

摘要:由於遙感影像中像素的佔地面積較大,超光譜解混(HU)已成為超光譜影像分析中一個重要且必要的程序。傳統的HU方法依賴於先前的光譜混合模型,特別是對於非線性混合,這在很大程度上限制了解混方法的性能和泛化能力。在本文中,我們解決了在沒有明確混合模型知識的情況下進行超光譜非線性解混(HNU)的挑戰性問題。受到生成模型原則的啟發,該原則允許在不知道影像的確切概率分佈函數的情況下生成與訓練影像具有相同分佈的影像,我們通過雙向GAN框架開發了一個可逆的混合-解混過程,並受到循環一致性和線性與非線性混合之間的聯繫的約束。循環一致性和線性聯繫的結合提供了強有力的約束,而不需要明確的混合模型。我們將所提出的方法稱為線性約束循環生成對抗網絡解混網,或稱LCGU網。實驗結果表明,所提出的LCGU網在不同數據集上表現出穩定且具競爭力的性能,與其他最先進的基於模型的HNU方法相比。

Lightweight Prompt-Guided CLIP Adaptation for Monocular Depth Estimation

2604.01118v1 by Reyhaneh Ahani Manghotay, Jie Liang

Leveraging the rich semantic features of vision-language models (VLMs) like CLIP for monocular depth estimation tasks is a promising direction, yet often requires extensive fine-tuning or lacks geometric precision. We present a parameter-efficient framework, named MoA-DepthCLIP, that adapts pretrained CLIP representations for monocular depth estimation with minimal supervision. Our method integrates a lightweight Mixture-of-Adapters (MoA) module into the pretrained Vision Transformer (ViT-B/32) backbone combined with selective fine-tuning of the final layers. This design enables spatially-aware adaptation, guided by a global semantic context vector and a hybrid prediction architecture that synergizes depth bin classification with direct regression. To enhance structural accuracy, we employ a composite loss function that enforces geometric constraints. On the NYU Depth V2 benchmark, MoA-DepthCLIP achieves competitive results, significantly outperforming the DepthCLIP baseline by improving the $δ_1$ accuracy from 0.390 to 0.745 and reducing the RMSE from 1.176 to 0.520. These results are achieved while requiring substantially few trainable parameters, demonstrating that lightweight, prompt-guided MoA is a highly effective strategy for transferring VLM knowledge to fine-grained monocular depth estimation tasks.

摘要:利用像 CLIP 這樣的視覺-語言模型 (VLMs) 的豐富語義特徵來進行單眼深度估計任務是一個有前景的方向,但通常需要大量的微調或缺乏幾何精確性。我們提出了一個名為 MoA-DepthCLIP 的參數高效框架,該框架在最小監督下調整預訓練的 CLIP 表示以進行單眼深度估計。我們的方法將輕量級的混合適配器 (Mixture-of-Adapters, MoA) 模塊集成到預訓練的視覺Transformer (Vision Transformer, ViT-B/32) 主幹中,並結合對最後幾層的選擇性微調。這一設計使得能夠進行空間感知的適配,並由全局語義上下文向量和一種混合預測架構引導,該架構協同深度區間分類和直接回歸。為了增強結構準確性,我們採用了強制幾何約束的復合損失函數。在 NYU Depth V2 基準上,MoA-DepthCLIP 實現了具有競爭力的結果,顯著超越了 DepthCLIP 基線,將 $δ_1$ 準確度從 0.390 提高到 0.745,並將 RMSE 從 1.176 降低到 0.520。這些結果是在需要的可訓練參數顯著較少的情況下實現的,顯示出輕量級、提示引導的 MoA 是將 VLM 知識轉移到細粒度單眼深度估計任務中的一種非常有效的策略。

Revision or Re-Solving? Decomposing Second-Pass Gains in Multi-LLM Pipelines

2604.01029v1 by Jingjie Ning, Xueqi Li, Chengyu Yu

Multi-LLM revision pipelines, in which a second model reviews and improves a draft produced by a first, are widely assumed to derive their gains from genuine error correction. We question this assumption with a controlled decomposition experiment that uses four matched conditions to separate second-pass gains into three additive components: re-solving, scaffold, and content. We evaluate this design across two model pairs on three benchmarks spanning knowledge-intensive MCQ and competitive programming. Our results show that the gains of multi-LLM revision are not monolithic, but depend on task structure, draft quality, and the type of draft information. On MCQ tasks, where the answer space is constrained and drafts provide little structural guidance, most gains are consistent with stronger-model re-solving, and directly routing queries to the stronger model can be more effective than revising a weak draft. On code generation tasks, however, two-stage prompting remains useful because even semantically null drafts can provide substantial structural scaffolding, while weak draft content can be harmful. Finally, role-reversed experiments show that strong drafts clearly benefit weak reviewers. Ultimately, our findings demonstrate that the utility of multi-LLM revision is dynamically bottlenecked by task structure and draft quality, necessitating more targeted pipeline designs rather than blanket revision strategies.

摘要:多LLM修訂流程中,第二個模型審查並改善第一個模型產生的草稿,普遍被認為其增益來自真正的錯誤修正。我們通過一個受控的分解實驗質疑這一假設,該實驗使用四個匹配條件將第二次通過的增益分解為三個可加的組件:重新解決、支架和內容。我們在兩對模型上對這一設計進行評估,並在三個基準上進行測試,這些基準涵蓋了知識密集型的選擇題和競爭性編程。我們的結果顯示,多LLM修訂的增益並不是單一的,而是依賴於任務結構、草稿質量和草稿信息的類型。在選擇題任務中,答案空間受到限制,草稿提供的結構指導有限,因此大多數增益與更強模型的重新解決一致,並且直接將查詢路由到更強的模型可能比修訂一個弱草稿更有效。然而,在代碼生成任務中,兩階段提示仍然有用,因為即使是語義上無效的草稿也能提供相當大的結構支架,而弱草稿內容則可能有害。最後,角色反轉的實驗顯示,強草稿明顯有利於弱審稿者。最終,我們的研究結果表明,多LLM修訂的效用受到任務結構和草稿質量的動態瓶頸,這需要更具針對性的流程設計,而不是一刀切的修訂策略。

Bridging Structured Knowledge and Data: A Unified Framework with Finance Applications

2604.00987v1 by Yi Cao, Zexun Chen, Lin William Cong, Heqing Shi

We develop Structured-Knowledge-Informed Neural Networks (SKINNs), a unified estimation framework that embeds theoretical, simulated, previously learned, or cross-domain insights as differentiable constraints within flexible neural function approximation. SKINNs jointly estimate neural network parameters and economically meaningful structural parameters in a single optimization problem, enforcing theoretical consistency not only on observed data but over a broader input domain through collocation, and therefore nesting approaches such as functional GMM, Bayesian updating, transfer learning, PINNs, and surrogate modeling. SKINNs define a class of M-estimators that are consistent and asymptotically normal with root-N convergence, sandwich covariance, and recovery of pseudo-true parameters under misspecification. We establish identification of structural parameters under joint flexibility, derive generalization and target-risk bounds under distributional shift in a convex proxy, and provide a restricted-optimal characterization of the weighting parameter that governs the bias-variance tradeoff. In an illustrative financial application to option pricing, SKINNs improve out-of-sample valuation and hedging performance, particularly at longer horizons and during high-volatility regimes, while recovering economically interpretable structural parameters with improved stability relative to conventional calibration. More broadly, SKINNs provide a general econometric framework for combining model-based reasoning with high-dimensional, data-driven estimation.

摘要:我們開發了結構知識引導的神經網絡(SKINNs),這是一個統一的估計框架,將理論、模擬、先前學習的或跨領域的見解作為可微約束嵌入到靈活的神經函數近似中。SKINNs在單一優化問題中共同估計神經網絡參數和經濟上有意義的結構參數,通過協同位置強制理論一致性,不僅在觀察數據上,還在更廣泛的輸入域中,因此嵌套了如功能GMM、貝葉斯更新、轉移學習、PINNs和代理建模等方法。SKINNs定義了一類M估計量,這些估計量在根-N收斂下是一致的和漸近正態的,具有夾心協方差,並在錯誤規範下恢復偽真參數。我們在聯合靈活性下建立了結構參數的識別,推導了在凸代理下分佈轉移的泛化和目標風險界限,並提供了控制偏差-方差權衡的加權參數的限制最優特徵。在一個說明性的金融應用中,針對期權定價,SKINNs改善了樣本外估值和對沖表現,特別是在較長的時間範圍和高波動性時期,同時相對於傳統的校準恢復了經濟上可解釋的結構參數,並提高了穩定性。更廣泛地說,SKINNs提供了一個將基於模型的推理與高維數據驅動估計相結合的一般計量經濟學框架。

Experience as a Compass: Multi-agent RAG with Evolving Orchestration and Agent Prompts

2604.00901v1 by Sha Li, Naren Ramakrishnan

Multi-agent Retrieval-Augmented Generation (RAG), wherein each agent takes on a specific role, supports hard queries that require multiple steps and sources, or complex reasoning. Existing approaches, however, rely on static agent behaviors and fixed orchestration strategies, leading to brittle performance on diverse, multi-hop tasks. We identify two key limitations: the lack of continuously adaptive orchestration mechanisms and the absence of behavior-level learning for individual agents. To this end, we propose HERA, a hierarchical framework that jointly evolves multi-agent orchestration and role-specific agent prompts. At the global level, HERA optimizes query-specific agent topologies through reward-guided sampling and experience accumulation. At the local level, Role-Aware Prompt Evolution refines agent behaviors via credit assignment and dual-axes adaptation along operational and behavioral principles, enabling targeted, role-conditioned improvements. On six knowledge-intensive benchmarks, HERA achieves an average improvement of 38.69\% over recent baselines while maintaining robust generalization and token efficiency. Topological analyses reveal emergent self-organization, where sparse exploration yields compact, high-utility multi-agent networks, demonstrating both efficient coordination and robust reasoning.

摘要:多智能體檢索增強生成(RAG),其中每個智能體擔任特定角色,支持需要多步驟和多來源或複雜推理的困難查詢。 然而,現有的方法依賴於靜態智能體行為和固定的協調策略,導致在多樣化的多跳任務上表現脆弱。 我們確定了兩個主要限制:缺乏持續自適應的協調機制以及缺乏針對個別智能體的行為層級學習。 為此,我們提出了HERA,一個分層框架,聯合演化多智能體協調和角色特定的智能體提示。 在全局層面,HERA通過獎勵引導的採樣和經驗積累來優化查詢特定的智能體拓撲。 在局部層面,角色感知提示演化通過信貸分配和沿操作及行為原則的雙軸適應來細化智能體行為,使得有針對性、角色條件的改進成為可能。 在六個知識密集的基準上,HERA在保持穩健的泛化和標記效率的同時,實現了對近期基準的平均改進38.69\%。 拓撲分析顯示出新興的自組織現象,其中稀疏探索產生緊湊的高效能多智能體網絡,展示了高效的協調和穩健的推理。

Transforming OPACs into Intelligent Discovery Systems: An AI-Powered, Knowledge Graph-Driven Smart OPAC for Digital Libraries

2604.01262v1 by M. S. Rajeevan, B. Mini Devi

Traditional Online Public Access Catalogues (OPACs) are becoming less effective due to the rapid growth of scholarly literature. Conventional search methods, such as keyword indexing and Boolean queries, often fail to support efficient knowledge discovery. This paper proposes a Smart OPAC framework that transforms traditional OPACs into intelligent discovery systems using artificial intelligence and knowledge graph techniques. The framework enables semantic search, thematic filtering, and knowledge graph-based visualization to enhance user interaction and exploration. It integrates multiple open scholarly data sources and applies semantic embeddings to improve relevance and contextual understanding. The system supports exploratory search, semantic navigation, and refined result filtering based on user-defined themes. Quantitative evaluation demonstrates improvements in retrieval efficiency, relevance, and reduction of information overload. The proposed approach offers practical implications for modernizing digital library services and supports next-generation research workflows. Future work includes user-centric evaluation, personalization, and dynamic knowledge graph updates.

摘要:傳統的線上公共存取目錄(OPACs)因學術文獻的快速增長而變得不那麼有效。傳統的搜索方法,如關鍵字索引和布爾查詢,往往無法支持高效的知識發現。本文提出了一個智能 OPAC 框架,將傳統 OPAC 轉變為使用人工智慧和知識圖譜技術的智能發現系統。該框架支持語義搜索、主題過濾和基於知識圖譜的可視化,以增強用戶互動和探索。它整合了多個開放的學術數據來源,並應用語義嵌入來改善相關性和上下文理解。該系統支持探索性搜索、語義導航和基於用戶定義主題的精煉結果過濾。定量評估顯示檢索效率、相關性改善以及信息過載的減少。所提出的方法為現代化數字圖書館服務提供了實際意義,並支持下一代研究工作流程。未來的工作包括以用戶為中心的評估、個性化和動態知識圖譜更新。

LinguDistill: Recovering Linguistic Ability in Vision- Language Models via Selective Cross-Modal Distillation

2604.00829v1 by Patrick Amadeus Irawan, Erland Hilman Fuadi, Shanu Kumar, Alham Fikri Aji, Yova Kementchedjhieva

Adapting pretrained language models (LMs) into vision-language models (VLMs) can degrade their native linguistic capability due to representation shift and cross-modal interference introduced during multimodal adaptation. Such loss is difficult to recover, even with targeted task-specific fine-tuning using standard objectives. Prior recovery approaches typically introduce additional modules that act as intermediate alignment layers to maintain or isolate modality-specific subspaces, which increases architectural complexity, adds parameters at inference time, and limits flexibility across models and settings. We propose LinguDistill, an adapter-free distillation method that restores linguistic capability by utilizing the original frozen LM as a teacher. We overcome the key challenge of enabling vision-conditioned teacher supervision by introducing layer-wise KV-cache sharing, which exposes the teacher to the student's multimodal representations without modifying the architecture of either model. We then selectively distill the teacher's strong linguistic signal on language-intensive data to recover language capability, while preserving the student's visual grounding on multimodal tasks. As a result, LinguDistill recovers $\sim$10% of the performance lost on language and knowledge benchmarks, while maintaining comparable performance on vision-heavy tasks. Our findings demonstrate that linguistic capability can be recovered without additional modules, providing an efficient and practical solution to modality-specific degradation in multimodal models.

摘要:調整預訓練的語言模型 (LMs) 成為視覺-語言模型 (VLMs) 可能會因為在多模態適應過程中引入的表示轉移和跨模態干擾而降低其原生語言能力。這種損失很難恢復,即使使用標準目標進行針對特定任務的微調。先前的恢復方法通常會引入額外的模組,作為中間對齊層,以維持或隔離模態特定的子空間,這增加了架構的複雜性,在推理時增加了參數,並限制了模型和設置之間的靈活性。我們提出了 LinguDistill,一種無需適配器的蒸餾方法,通過利用原始的凍結 LM 作為教師來恢復語言能力。我們通過引入層級 KV-cache 共享來克服啟用視覺條件教師監督的關鍵挑戰,這使得教師能夠接觸到學生的多模態表示,而不修改任何模型的架構。我們然後選擇性地蒸餾教師在語言密集數據上的強語言信號,以恢復語言能力,同時保持學生在多模態任務上的視覺基礎。因此,LinguDistill 恢復了在語言和知識基準上損失的約 10% 的性能,同時在視覺密集任務上保持了可比的性能。我們的研究結果表明,語言能力可以在不增加額外模組的情況下恢復,為多模態模型中模態特定的退化提供了一個高效且實用的解決方案。

From Early Encoding to Late Suppression: Interpreting LLMs on Character Counting Tasks

2604.00778v1 by Ayan Datta, Mounika Marreddy, Alexander Mehler, Zhixue Zhao, Radhika Mamidi

Large language models (LLMs) exhibit failures on elementary symbolic tasks such as character counting in a word, despite excelling on complex benchmarks. Although this limitation has been noted, the internal reasons remain unclear. We use character counting (e.g., "How many p's are in apple?") as a minimal, controlled probe that isolates token-level reasoning from higher-level confounds. Using this setting, we uncover a consistent phenomenon across modern architectures, including LLaMA, Qwen, and Gemma: models often compute the correct answer internally yet fail to express it at the output layer. Through mechanistic analysis combining probing classifiers, activation patching, logit lens analysis, and attention head tracing, we show that character-level information is encoded in early and mid-layer representations. However, this information is attenuated by a small set of components in later layers, especially the penultimate and final layer MLP. We identify these components as negative circuits: subnetworks that downweight correct signals in favor of higher-probability but incorrect outputs. Our results lead to two contributions. First, we show that symbolic reasoning failures in LLMs are not due to missing representations or insufficient scale, but arise from structured interference within the model's computation graph. This explains why such errors persist and can worsen under scaling and instruction tuning. Second, we provide evidence that LLM forward passes implement a form of competitive decoding, in which correct and incorrect hypotheses coexist and are dynamically reweighted, with final outputs determined by suppression as much as by amplification. These findings carry implications for interpretability and robustness: simple symbolic reasoning exposes weaknesses in modern LLMs, underscoring need for design strategies that ensure information is encoded and reliably used.

摘要:大型語言模型(LLMs)在基本的符號任務上,例如計算單詞中的字元數,表現不佳,儘管在複雜的基準測試中表現出色。雖然這一限制已被注意到,但內部原因仍不清楚。我們使用字元計數(例如,「apple 中有多少個 p?」)作為一個最小的、受控的探測,以將標記級推理與更高級的干擾隔離開來。在這個設定中,我們發現了一個在現代架構中一致的現象,包括 LLaMA、Qwen 和 Gemma:模型通常在內部計算出正確的答案,但在輸出層卻未能表達出來。通過結合探測分類器、激活修補、邏輯透鏡分析和注意力頭追踪的機械分析,我們顯示字元級信息在早期和中層表示中被編碼。然而,這些信息在後層,特別是倒數第二層和最後一層的 MLP 中,受到一小組組件的削弱。我們將這些組件識別為負電路:在網絡中下調正確信號的子網絡,偏向於更高概率但不正確的輸出。我們的結果導致了兩項貢獻。首先,我們表明 LLMs 中的符號推理失敗並不是由於缺失的表示或不充分的規模,而是源於模型計算圖中的結構性干擾。這解釋了為什麼這類錯誤持續存在,並且在擴展和指令調整下可能惡化。其次,我們提供證據表明 LLM 的前向傳遞實現了一種競爭解碼的形式,其中正確和不正確的假設共存並且動態重新加權,最終輸出由抑制和放大共同決定。這些發現對可解釋性和穩健性具有重要意義:簡單的符號推理暴露了現代 LLM 的弱點,強調了需要設計策略來確保信息被編碼並可靠地使用。

BioCOMPASS: Integrating Biomarkers into Transformer-Based Immunotherapy Response Prediction

2604.00739v1 by Sayed Hashim, Frank Soboczenski, Paul Cairns

Datasets used in immunotherapy response prediction are typically small in size, as well as diverse in cancer type, drug administered, and sequencer used. Models often drop in performance when tested on patient cohorts that are not included in the training process. Recent work has shown that transformer-based models along with self-supervised learning show better generalisation performance than threshold-based biomarkers, but is still suboptimal. We present BioCOMPASS, an extension of a transformer-based model called COMPASS, that integrates biomarkers and treatment information to further improve its generalisability. Instead of feeding biomarker data as input, we built loss components to align them with the model's intermediate representations. We found that components such as treatment gating and pathway consistency loss improved generalisability when evaluated with Leave-one-cohort-out, Leave-one-cancer-type-out and Leave-one-treatment-out strategies. Results show that building components that exploit biomarker and treatment information can help in generalisability of immunotherapy response prediction. Careful curation of additional components that leverage complementary clinical information and domain knowledge represents a promising direction for future research.

摘要:使用於免疫療法反應預測的數據集通常規模較小,且在癌症類型、所施用的藥物和使用的測序儀器上具有多樣性。當在未包含於訓練過程中的患者群體上進行測試時,模型的性能往往會下降。最近的研究顯示,基於Transformer的模型結合自我監督學習在泛化性能上優於基於閾值的生物標記,但仍然不夠理想。我們提出了BioCOMPASS,這是一個基於Transformer的模型COMPASS的擴展,整合了生物標記和治療信息以進一步提高其泛化能力。我們並不是將生物標記數據作為輸入,而是構建了損失組件以使其與模型的中間表示對齊。我們發現,治療閘控和通路一致性損失等組件在使用Leave-one-cohort-out、Leave-one-cancer-type-out和Leave-one-treatment-out策略進行評估時,提高了泛化能力。結果顯示,構建利用生物標記和治療信息的組件可以幫助提高免疫療法反應預測的泛化能力。仔細策劃利用互補臨床信息和領域知識的額外組件,代表了未來研究的一個有前景的方向。

To Memorize or to Retrieve: Scaling Laws for RAG-Considerate Pretraining

2604.00715v1 by Karan Singh, Michael Yu, Varun Gangal, Zhuofu Tao, Sachin Kumar, Emmy Liu, Steven Y. Feng

Retrieval-augmented generation (RAG) improves language model (LM) performance by providing relevant context at test time for knowledge-intensive situations. However, the relationship between parametric knowledge acquired during pretraining and non-parametric knowledge accessed via retrieval remains poorly understood, especially under fixed data budgets. In this work, we systematically study the trade-off between pretraining corpus size and retrieval store size across a wide range of model and data scales. We train OLMo-2-based LMs ranging from 30M to 3B parameters on up to 100B tokens of DCLM data, while varying both pretraining data scale (1-150x the number of parameters) and retrieval store size (1-20x), and evaluate performance across a diverse suite of benchmarks spanning reasoning, scientific QA, and open-domain QA. We find that retrieval consistently improves performance over parametric-only baselines across model scales and introduce a three-dimensional scaling framework that models performance as a function of model size, pretraining tokens, and retrieval corpus size. This scaling manifold enables us to estimate optimal allocations of a fixed data budget between pretraining and retrieval, revealing that the marginal utility of retrieval depends strongly on model scale, task type, and the degree of pretraining saturation. Our results provide a quantitative foundation for understanding when and how retrieval should complement pretraining, offering practical guidance for allocating data resources in the design of scalable language modeling systems.

摘要:檢索增強生成(RAG)透過在測試時為知識密集型情況提供相關上下文來改善語言模型(LM)的性能。然而,在預訓練期間獲得的參數知識與通過檢索訪問的非參數知識之間的關係仍然不甚了解,特別是在固定數據預算下。在這項工作中,我們系統性地研究了預訓練語料庫大小與檢索庫大小之間的權衡,涵蓋了廣泛的模型和數據規模。我們訓練了基於OLMo-2的LM,參數範圍從30M到3B,使用多達100B的DCLM數據,同時變化預訓練數據規模(參數數量的1-150倍)和檢索庫大小(1-20倍),並在涵蓋推理、科學問答和開放域問答的多樣基準上評估性能。我們發現檢索在各模型規模上始終能改善性能,超越僅使用參數的基準,並引入一個三維縮放框架,將性能建模為模型大小、預訓練標記和檢索語料庫大小的函數。這個縮放流形使我們能夠估算在預訓練和檢索之間固定數據預算的最佳分配,顯示檢索的邊際效用在很大程度上取決於模型規模、任務類型和預訓練飽和度的程度。我們的結果為理解何時以及如何讓檢索補充預訓練提供了定量基礎,並為在可擴展語言建模系統設計中分配數據資源提供了實用指導。

Internal APIs Are All You Need: Shadow APIs, Shared Discovery, and the Case Against Browser-First Agent Architectures

2604.00694v1 by Lewis Tham, Nicholas Mac Gregor Garcia, Jungpil Hahn

Autonomous agents increasingly interact with the web, yet most websites remain designed for human browsers -- a fundamental mismatch that the emerging ``Agentic Web'' must resolve. Agents must repeatedly browse pages, inspect DOMs, and reverse-engineer callable routes -- a process that is slow, brittle, and redundantly repeated across agents. We observe that every modern website already exposes internal APIs (sometimes called \emph{shadow APIs}) behind its user interface -- first-party endpoints that power the site's own functionality. We present Unbrowse, a shared route graph that transforms browser-based route discovery into a collectively maintained index of these callable first-party interfaces. The system passively learns routes from real browsing traffic and serves cached routes via direct API calls. In a single-host live-web benchmark of equivalent information-retrieval tasks across 94 domains, fully warmed cached execution averaged 950\,ms versus 3{,}404\,ms for Playwright browser automation (3.6$\times$ mean speedup, 5.4$\times$ median), with well-cached routes completing in under 100\,ms. A three-path execution model -- local cache, shared graph, or browser fallback -- ensures the system is voluntary and self-correcting. A three-tier micropayment model via the x402 protocol charges per-query search fees for graph lookups (Tier~3), a one-time install fee for discovery documentation (Tier~1), and optional per-execution fees for site owners who opt in (Tier~2). All tiers are grounded in a necessary condition for rational adoption: an agent uses the shared graph only when the total fee is lower than the expected cost of browser rediscovery.

摘要:自主代理人越來越多地與網絡互動,但大多數網站仍然是為人類瀏覽器設計的——這是一個基本的不匹配,正在興起的「代理網絡」必須解決這個問題。代理人必須反覆瀏覽頁面、檢查DOM,並逆向工程可調用路由——這個過程既緩慢又脆弱,且在代理人之間冗餘重複。我們觀察到,每個現代網站已經在其用戶界面背後暴露了內部API(有時稱為\emph{影子API})——這些第一方端點為網站自身的功能提供支持。我們提出了Unbrowse,一個共享路由圖,將基於瀏覽器的路由發現轉變為這些可調用第一方接口的集體維護索引。該系統被動地從實際瀏覽流量中學習路由,並通過直接API調用提供緩存路由。在94個域名的單主機實時網絡基準測試中,相同信息檢索任務的完全預熱緩存執行平均為950毫秒,而Playwright瀏覽器自動化則為3,404毫秒(平均加速3.6倍,中位數加速5.4倍),而緩存良好的路由在100毫秒內完成。三路徑執行模型——本地緩存、共享圖或瀏覽器回退——確保系統是自願和自我修正的。通過x402協議的三級微支付模型對圖查詢收取每次查詢的搜索費用(三級),對發現文檔收取一次性安裝費用(一级),以及對選擇參與的網站擁有者收取可選的每次執行費用(二級)。所有層級都基於理性採用的必要條件:代理人僅在總費用低於瀏覽器重新發現的預期成本時,才使用共享圖。

A Survey of On-Policy Distillation for Large Language Models

2604.00626v1 by Mingyang Song, Mao Zheng

Knowledge distillation has become a primary mechanism for transferring reasoning and domain expertise from frontier Large Language Models (LLMs) to smaller, deployable students. However, the dominant paradigm remains \textit{off-policy}: students train on static teacher-generated data and never encounter their own errors during learning. This train--test mismatch, an instance of \textit{exposure bias}, causes prediction errors to compound autoregressively at inference time. On-Policy Distillation (OPD) addresses this by letting the student generate its own trajectories and receive teacher feedback on these self-generated outputs, grounding distillation in the theory of interactive imitation learning. Despite rapid growth spanning divergence minimization, reward-guided learning, and self-play, the OPD literature remains fragmented with no unified treatment. This survey provides the first comprehensive overview of OPD for LLMs. We introduce a unified $f$-divergence framework over on-policy samples and organize the landscape along three orthogonal dimensions: \emph{feedback signal} (logit-based, outcome-based, or self-play), \emph{teacher access} (white-box, black-box, or teacher-free), and \emph{loss granularity} (token-level, sequence-level, or hybrid). We systematically analyze representative methods, examine industrial deployments, and identify open problems including distillation scaling laws, uncertainty-aware feedback, and agent-level distillation.

摘要:知識蒸餾已成為將推理和領域專業知識從前沿大型語言模型(LLMs)轉移到較小、可部署學生的主要機制。然而,主導的範式仍然是\textit{off-policy}:學生在靜態教師生成的數據上訓練,並且在學習過程中從未遇到自己的錯誤。這種訓練--測試不匹配,作為\textit{暴露偏差}的一個例子,導致在推斷時預測錯誤自動回饋地累積。On-Policy Distillation (OPD) 通過讓學生生成自己的軌跡並對這些自生成的輸出接收教師反饋來解決這個問題,將蒸餾基於互動模仿學習的理論。儘管在發散最小化、獎勵引導學習和自我對弈方面迅速增長,但OPD文獻仍然支離破碎,缺乏統一的處理。這項調查提供了LLMs的OPD的首個綜合概述。我們介紹了一個統一的$f$-發散框架,針對在政策樣本上進行組織,並沿著三個正交維度進行佈局:\emph{反饋信號}(基於logit、基於結果或自我對弈)、\emph{教師訪問}(白盒、黑盒或無教師)以及\emph{損失粒度}(標記級、序列級或混合)。我們系統地分析了代表性方法,檢查了工業部署,並確定了包括蒸餾縮放法則、不確定性感知反饋和代理級蒸餾在內的開放問題。

Speech LLMs are Contextual Reasoning Transcribers

2604.00610v1 by Keqi Deng, Ruchao Fan, Bo Ren, Yiming Wang, Jinyu Li

Despite extensions to speech inputs, effectively leveraging the rich knowledge and contextual understanding of large language models (LLMs) in automatic speech recognition (ASR) remains non-trivial, as the task primarily involves direct speech-to-text mapping. To address this, this paper proposes chain-of-thought ASR (CoT-ASR), which constructs a reasoning chain that enables LLMs to first analyze the input speech and generate contextual analysis, thereby fully exploiting their generative capabilities. With this contextual reasoning, CoT-ASR then performs more informed speech recognition and completes both reasoning and transcription in a single pass. Moreover, CoT-ASR naturally supports user-guided transcription: while designed to self-generate reasoning, it can also seamlessly incorporate user-provided context to guide transcription, further extending ASR functionality. To reduce the modality gap, this paper introduces a CTC-guided Modality Adapter, which uses CTC non-blank token probabilities to weight LLM embeddings, efficiently aligning speech encoder outputs with the LLM's textual latent space. Experiments show that, compared to standard LLM-based ASR, CoT-ASR achieves a relative reduction of 8.7% in word error rate (WER) and 16.9% in entity error rate (EER).

摘要:儘管擴展了語音輸入,但在自動語音識別(ASR)中有效利用大型語言模型(LLMs)的豐富知識和上下文理解仍然不是一件簡單的事,因為這項任務主要涉及直接的語音轉文本映射。為了解決這個問題,本文提出了思維鏈 ASR(CoT-ASR),它構建了一個推理鏈,使 LLMs 能夠首先分析輸入的語音並生成上下文分析,從而充分發揮其生成能力。通過這種上下文推理,CoT-ASR 然後進行更有根據的語音識別,並在一次處理中完成推理和轉錄。此外,CoT-ASR 自然支持用戶引導的轉錄:雖然設計為自我生成推理,但它也可以無縫地納入用戶提供的上下文來指導轉錄,進一步擴展 ASR 的功能。為了減少模態差距,本文引入了一種 CTC 引導模態適配器,該適配器使用 CTC 非空白標記概率來加權 LLM 嵌入,從而有效地將語音編碼器輸出與 LLM 的文本潛在空間對齊。實驗表明,與標準基於 LLM 的 ASR 相比,CoT-ASR 在字錯誤率(WER)上實現了 8.7% 的相對減少,在實體錯誤率(EER)上實現了 16.9% 的相對減少。

Ontology-Constrained Neural Reasoning in Enterprise Agentic Systems: A Neurosymbolic Architecture for Domain-Grounded AI Agents

2604.00555v1 by Thanh Luong Tuan

Enterprise adoption of Large Language Models (LLMs) is constrained by hallucination, domain drift, and the inability to enforce regulatory compliance at the reasoning level. We present a neurosymbolic architecture implemented within the Foundation AgenticOS (FAOS) platform that addresses these limitations through ontology-constrained neural reasoning. Our approach introduces a three-layer ontological framework--Role, Domain, and Interaction ontologies--that provides formal semantic grounding for LLM-based enterprise agents. We formalize the concept of asymmetric neurosymbolic coupling, wherein symbolic ontological knowledge constrains agent inputs (context assembly, tool discovery, governance thresholds) while proposing mechanisms for extending this coupling to constrain agent outputs (response validation, reasoning verification, compliance checking). We evaluate the architecture through a controlled experiment (600 runs across five industries: FinTech, Insurance, Healthcare, Vietnamese Banking, and Vietnamese Insurance), finding that ontology-coupled agents significantly outperform ungrounded agents on Metric Accuracy (p < .001, W = .460), Regulatory Compliance (p = .003, W = .318), and Role Consistency (p < .001, W = .614), with improvements greatest where LLM parametric knowledge is weakest--particularly in Vietnam-localized domains. Our contributions include: (1) a formal three-layer enterprise ontology model, (2) a taxonomy of neurosymbolic coupling patterns, (3) ontology-constrained tool discovery via SQL-pushdown scoring, (4) a proposed framework for output-side ontological validation, (5) empirical evidence for the inverse parametric knowledge effect that ontological grounding value is inversely proportional to LLM training data coverage of the domain, and (6) a production system serving 21 industry verticals with 650+ agents.

摘要:企業採用大型語言模型(LLMs)受到幻覺、領域漂移以及無法在推理層面強制執行法規遵循的限制。我們提出了一種在基礎代理操作系統(Foundation AgenticOS, FAOS)平台上實現的神經符號架構,通過本體約束的神經推理來解決這些限制。我們的方法引入了一個三層的本體框架——角色、本體和互動本體——為基於LLM的企業代理提供正式的語義基礎。我們形式化了不對稱神經符號耦合的概念,其中符號本體知識約束代理輸入(上下文組合、工具發現、治理閾值),同時提出了擴展這種耦合以約束代理輸出(回應驗證、推理驗證、合規檢查)的機制。我們通過一項受控實驗(在五個行業中進行600次運行:金融科技、保險、醫療保健、越南銀行和越南保險)來評估該架構,發現本體耦合的代理在指標準確性(p < .001, W = .460)、法規遵循(p = .003, W = .318)和角色一致性(p < .001, W = .614)上顯著優於未經基礎的代理,且在LLM參數知識最弱的地方(特別是在越南本地化領域)改善最為明顯。我們的貢獻包括:(1)一個正式的三層企業本體模型,(2)神經符號耦合模式的分類,(3)通過SQL下推評分的本體約束工具發現,(4)一個提出的輸出端本體驗證框架,(5)對逆參數知識效應的實證證據,即本體基礎價值與LLM訓練數據對該領域的覆蓋率成反比,以及(6)一個為21個行業垂直領域提供650多個代理的生產系統。

Optimsyn: Influence-Guided Rubrics Optimization for Synthetic Data Generation

2604.00536v1 by Zhiting Fan, Ruizhe Chen, Tianxiang Hu, Ru Peng, Zenan Huang, Haokai Xu, Yixin Chen, Jian Wu, Junbo Zhao, Zuozhu Liu

Large language models (LLMs) achieve strong downstream performance largely due to abundant supervised fine-tuning (SFT) data. However, high-quality SFT data in knowledge-intensive domains such as humanities, social sciences, medicine, law, and finance is scarce because expert curation is expensive, privacy constraints are strict, and label consistency is hard to ensure. Recent work uses synthetic data, typically by prompting a generator over domain documents and filtering outputs with handcrafted rubrics. Yet rubric design is expert-dependent, transfers poorly across domains, and is often optimized through a brittle heuristic loop of writing rubrics, synthesizing data, training, inspecting results, and manually guessing revisions. This process lacks reliable quantitative feedback about how a rubric affects downstream performance. We propose evaluating synthetic data by its training utility on the target model and using this signal to guide data generation. Inspired by influence estimation, we adopt an optimizer-aware estimator that uses gradient information to quantify each synthetic sample's contribution to a target model's objective on specific tasks. Our analysis shows that even when synthetic and real samples are close in embedding space, their influence on learning can differ substantially. Based on this insight, we propose an optimization-based framework that adapts rubrics using target-model feedback. We provide lightweight guiding text and use a rubric-specialized model to generate task-conditioned rubrics. Influence score is used as the reward to optimize the rubric generator with reinforcement learning. Experiments across domains, target models, and data generators show consistent improvements and strong generalization without task-specific tuning.

摘要:大型語言模型(LLMs)之所以能在下游任務中表現出色,主要是因為擁有豐富的監督微調(SFT)數據。然而,在人文學科、社會科學、醫學、法律和金融等知識密集型領域,高品質的SFT數據卻相對稀缺,因為專家策劃成本高昂、隱私限制嚴格,且標籤一致性難以確保。近期的研究使用合成數據,通常是通過對領域文檔進行提示生成器並用手工標準過濾輸出。然而,標準設計依賴於專家,跨領域轉移效果不佳,且通常是通過脆弱的啟發式循環來優化,包括撰寫標準、合成數據、訓練、檢查結果和手動猜測修訂。這一過程缺乏關於標準如何影響下游性能的可靠定量反饋。我們建議通過其對目標模型的訓練效用來評估合成數據,並利用這一信號來指導數據生成。受到影響估計的啟發,我們採用了一種優化器感知的估計器,利用梯度信息量化每個合成樣本對特定任務上目標模型目標的貢獻。我們的分析顯示,即使合成樣本和真實樣本在嵌入空間中接近,它們對學習的影響也可能有顯著差異。基於這一見解,我們提出了一個基於優化的框架,利用目標模型反饋來調整標準。我們提供輕量級的指導文本,並使用專門的模型生成任務條件的標準。影響分數被用作獎勵,以強化學習優化標準生成器。跨領域、目標模型和數據生成器的實驗顯示出一致的改進和強大的泛化能力,而無需特定任務的調整。

Polysemanticity or Polysemy? Lexical Identity Confounds Superposition Metrics

2604.00443v1 by Iyad Ait Hou, Rebecca Hwa

If the same neuron activates for both "lender" and "riverside," standard metrics attribute the overlap to superposition--the neuron must be compressing two unrelated concepts. This work explores how much of the overlap is due a lexical confound: neurons fire for a shared word form (such as "bank") rather than for two compressed concepts. A 2x2 factorial decomposition reveals that the lexical-only condition (same word, different meaning) consistently exceeds the semantic-only condition (different word, same meaning) across models spanning 110M-70B parameters. The confound carries into sparse autoencoders (18-36% of features blend senses), sits in <=1% of activation dimensions, and hurts downstream tasks: filtering it out improves word sense disambiguation and makes knowledge edits more selective (p = 0.002).

摘要:如果同一個神經元對「貸方」和「河岸」都激活,標準指標將重疊歸因於重疊性——該神經元必須在壓縮兩個無關的概念。這項工作探討重疊有多少是由於詞彙混淆:神經元是因為共享的詞形(例如「銀行」)而激發,而不是因為兩個壓縮的概念。2x2因子分解顯示,僅詞彙條件(相同的單詞,不同的意義)在跨越110M-70B參數的模型中始終超過語義條件(不同的單詞,相同的意義)。這種混淆在稀疏自編碼器中延續(18-36%的特徵混合意義),位於<=1%的激活維度中,並且對下游任務造成損害:過濾掉它可以改善詞義消歧並使知識編輯更具選擇性(p = 0.002)。

TR-ICRL: Test-Time Rethinking for In-Context Reinforcement Learning

2604.00438v1 by Wenxuan Jiang, Yuxin Zuo, Zijian Zhang, Xuecheng Wu, Zining Fan, Wenxuan Liu, Li Chen, Xiaoyu Li, Xuezhi Cao, Xiaolong Jin, Ninghao Liu

In-Context Reinforcement Learning (ICRL) enables Large Language Models (LLMs) to learn online from external rewards directly within the context window. However, a central challenge in ICRL is reward estimation, as models typically lack access to ground-truths during inference. To address this limitation, we propose Test-Time Rethinking for In-Context Reinforcement Learning (TR-ICRL), a novel ICRL framework designed for both reasoning and knowledge-intensive tasks. TR-ICRL operates by first retrieving the most relevant instances from an unlabeled evaluation set for a given query. During each ICRL iteration, LLM generates a set of candidate answers for every retrieved instance. Next, a pseudo-label is derived from this set through majority voting. This label then serves as a proxy to give reward messages and generate formative feedbacks, guiding LLM through iterative refinement. In the end, this synthesized contextual information is integrated with the original query to form a comprehensive prompt, with the answer determining through a final round of majority voting. TR-ICRL is evaluated on mainstream reasoning and knowledge-intensive tasks, where it demonstrates significant performance gains. Remarkably, TR-ICRL improves Qwen2.5-7B by 21.23% on average on MedQA and even 137.59% on AIME2024. Extensive ablation studies and analyses further validate the effectiveness and robustness of our approach. Our code is available at https://github.com/pangpang-xuan/TR_ICRL.

摘要:在情境強化學習(ICRL)中,允許大型語言模型(LLMs)直接從外部獎勵中在線學習,這些獎勵是在情境窗口內獲得的。然而,ICRL中的一個主要挑戰是獎勵估計,因為模型在推理過程中通常無法訪問真實數據。為了解決這一限制,我們提出了情境強化學習的測試時重新思考(TR-ICRL),這是一個針對推理和知識密集型任務設計的新型ICRL框架。TR-ICRL的運作方式是首先從未標記的評估集檢索與給定查詢最相關的實例。在每次ICRL迭代中,LLM會為每個檢索到的實例生成一組候選答案。接下來,通過多數投票從這組中推導出一個伪標籤。這個標籤隨後作為代理來提供獎勵信息並生成形成性反饋,指導LLM進行迭代改進。最後,這些合成的上下文信息與原始查詢整合,形成一個綜合提示,答案通過最終一輪的多數投票來確定。TR-ICRL在主流推理和知識密集型任務上進行評估,顯示出顯著的性能提升。值得注意的是,TR-ICRL在MedQA上平均提高了Qwen2.5-7B的性能21.23%,在AIME2024上甚至提高了137.59%。廣泛的消融研究和分析進一步驗證了我們方法的有效性和穩健性。我們的代碼可在 https://github.com/pangpang-xuan/TR_ICRL 獲得。

COTTA: Context-Aware Transfer Adaptation for Trajectory Prediction in Autonomous Driving

2604.00402v1 by Seohyoung Park, Jaeyeol Lim, Seoyoung Ju, Kyeonghun Kim, Nam-Joon Kim, Hyuk-Jae Lee

Developing robust models to accurately predict the trajectories of surrounding agents is fundamental to autonomous driving safety. However, most public datasets, such as the Waymo Open Motion Dataset and Argoverse, are collected in Western road environments and do not reflect the unique traffic patterns, infrastructure, and driving behaviors of other regions, including South Korea. This domain discrepancy leads to performance degradation when state-of-the-art models trained on Western data are deployed in different geographic contexts. In this work, we investigate the adaptability of Query-Centric Trajectory Prediction (QCNet) when transferred from U.S.-based data to Korean road environments. Using a Korean autonomous driving dataset, we compare four training strategies: zero-shot transfer, training from scratch, full fine-tuning, and encoder freezing. Experimental results demonstrate that leveraging pretrained knowledge significantly improves prediction performance. Specifically, selectively fine-tuning the decoder while freezing the encoder yields the best trade-off between accuracy and training efficiency, reducing prediction error by over 66% compared to training from scratch. This study provides practical insights into effective transfer learning strategies for deploying trajectory prediction models in new geographic domains.

摘要:開發穩健的模型以準確預測周圍代理的軌跡對於自動駕駛安全至關重要。然而,大多數公共數據集,例如 Waymo Open Motion Dataset 和 Argoverse,都是在西方道路環境中收集的,並未反映其他地區(包括南韓)獨特的交通模式、基礎設施和駕駛行為。這種領域差異導致當基於西方數據訓練的最先進模型在不同地理背景下部署時性能下降。在這項工作中,我們研究了 Query-Centric Trajectory Prediction (QCNet) 從美國數據轉移到韓國道路環境的適應性。使用韓國自動駕駛數據集,我們比較了四種訓練策略:零樣本轉移、從頭開始訓練、完全微調和編碼器凍結。實驗結果表明,利用預訓練知識顯著提高了預測性能。具體而言,選擇性地微調解碼器,同時凍結編碼器,實現了準確性和訓練效率之間的最佳權衡,預測誤差相比從頭開始訓練降低了超過 66%。這項研究為在新地理領域部署軌跡預測模型的有效轉移學習策略提供了實用的見解。

RAGShield: Provenance-Verified Defense-in-Depth Against Knowledge Base Poisoning in Government Retrieval-Augmented Generation Systems

2604.00387v1 by KrishnaSaiReddy Patil

RAG systems deployed across federal agencies for citizen-facing services are vulnerable to knowledge base poisoning attacks, where adversaries inject malicious documents to manipulate outputs. Recent work demonstrates that as few as 10 adversarial passages can achieve 98.2% retrieval success rates. We observe that RAG knowledge base poisoning is structurally analogous to software supply chain attacks, and propose RAGShield, a five-layer defense-in-depth framework applying supply chain provenance verification to the RAG knowledge pipeline. RAGShield introduces: (1) C2PA-inspired cryptographic document attestation blocking unsigned and forged documents at ingestion; (2) trust-weighted retrieval prioritizing provenance-verified sources; (3) a formal taint lattice with cross-source contradiction detection catching insider threats even when provenance is valid; (4) provenance-aware generation with auditable citations; and (5) NIST SP 800-53 compliance mapping across 15 control families. Evaluation on a 500-passage Natural Questions corpus with 63 attack documents and 200 queries against five adversary tiers achieves 0.0% attack success rate including adaptive attacks (95% CI: [0.0%, 1.9%]) with 0.0% false positive rate. We honestly report that insider in-place replacement attacks achieve 17.5% ASR, identifying the fundamental limit of ingestion-time defense. The cross-source contradiction detector catches subtle numerical manipulation attacks that bypass provenance verification entirely.

摘要:RAG 系統在聯邦機構中部署,面向公民的服務易受到知識庫毒害攻擊,對手會注入惡意文檔以操控輸出。最近的研究顯示,僅需 10 段對抗性文本即可達到 98.2% 的檢索成功率。我們觀察到 RAG 知識庫毒害在結構上類似於軟體供應鏈攻擊,並提出 RAGShield,一個五層深度防禦框架,將供應鏈來源驗證應用於 RAG 知識管道。RAGShield 引入了:(1) 受 C2PA 啟發的加密文檔驗證,阻止未簽名和偽造的文檔在進口時進入;(2) 以信任加權的檢索,優先考慮來源已驗證的資料;(3) 一個正式的污點格,具有跨來源矛盾檢測,即使來源有效也能捕捉內部威脅;(4) 具來源意識的生成,並附有可審計的引用;以及 (5) NIST SP 800-53 在 15 個控制家庭中的合規性映射。在一個包含 500 段自然問題語料庫、63 個攻擊文檔和 200 個查詢的評估中,針對五個對手層級的攻擊成功率達到 0.0%,包括自適應攻擊(95% CI: [0.0%, 1.9%]),且假陽性率為 0.0%。我們誠實報告,內部即時替換攻擊的 ASR 達到 17.5%,識別出進口時防禦的基本限制。跨來源矛盾檢測器捕捉到微妙的數值操控攻擊,完全繞過來源驗證。

Agent Q-Mix: Selecting the Right Action for LLM Multi-Agent Systems through Reinforcement Learning

2604.00344v1 by Eric Hanchen Jiang, Levina Li, Rui Sun, Xiao Liang, Yubei Li, Yuchen Wu, Haozheng Luo, Hengli Li, Zhi Zhang, Zhaolu Kang, Kai-Wei Chang, Ying Nian Wu

Large Language Models (LLMs) have shown remarkable performance in completing various tasks. However, solving complex problems often requires the coordination of multiple agents, raising a fundamental question: how to effectively select and interconnect these agents. In this paper, we propose \textbf{Agent Q-Mix}, a reinforcement learning framework that reformulates topology selection as a cooperative Multi-Agent Reinforcement Learning (MARL) problem. Our method learns decentralized communication decisions using QMIX value factorization, where each agent selects from a set of communication actions that jointly induce a round-wise communication graph. At its core, Agent Q-Mix combines a topology-aware GNN encoder, GRU memory, and per-agent Q-heads under a Centralized Training with Decentralized Execution (CTDE) paradigm. The framework optimizes a reward function that balances task accuracy with token cost. Across seven core benchmarks in coding, reasoning, and mathematics, Agent Q-Mix achieves the highest average accuracy compared to existing methods while demonstrating superior token efficiency and robustness against agent failure. Notably, on the challenging Humanity's Last Exam (HLE) using Gemini-3.1-Flash-Lite as a backbone, Agent Q-Mix achieves 20.8\% accuracy, outperforming Microsoft Agent Framework (19.2\%) and LangGraph (19.2\%), followed by AutoGen and Lobster by OpenClaw. These results underscore the effectiveness of learned, decentralized topology optimization in pushing the boundaries of multi-agent reasoning.

摘要:大型語言模型(LLMs)在完成各種任務方面展現了卓越的性能。然而,解決複雜問題通常需要多個代理的協調,這引發了一個根本性問題:如何有效地選擇和互連這些代理。在本文中,我們提出了\textbf{Agent Q-Mix},這是一個將拓撲選擇重新表述為合作多代理強化學習(MARL)問題的強化學習框架。我們的方法使用QMIX價值分解學習去中心化的通信決策,其中每個代理從一組通信行動中選擇,這些行動共同引發一個逐輪的通信圖。在其核心,Agent Q-Mix結合了拓撲感知的GNN編碼器、GRU記憶和每個代理的Q-heads,遵循集中訓練與去中心化執行(CTDE)範式。該框架優化了一個獎勵函數,平衡任務準確性與令牌成本。在編碼、推理和數學的七個核心基準中,Agent Q-Mix相比現有方法達到了最高的平均準確率,同時展現了卓越的令牌效率和對代理失效的魯棒性。值得注意的是,在使用Gemini-3.1-Flash-Lite作為骨幹的挑戰性人類最後考試(HLE)中,Agent Q-Mix達到了20.8\%的準確率,超越了微軟代理框架(19.2\%)和LangGraph(19.2\%),其次是OpenClaw的AutoGen和Lobster。這些結果強調了學習的去中心化拓撲優化在推動多代理推理邊界方面的有效性。

Improvisational Games as a Benchmark for Social Intelligence of AI Agents: The Case of Connections

2604.00284v1 by Gaurav Rajesh Parikh, Angikar Ghosal

We formally introduce a improvisational wordplay game called Connections to explore reasoning capabilities of AI agents. Playing Connections combines skills in knowledge retrieval, summarization and awareness of cognitive states of other agents. We show how the game serves as a good benchmark for social intelligence abilities of language model based agents that go beyond the agents' own memory and deductive reasoning and also involve gauging the understanding capabilities of other agents. Finally, we show how through communication with other agents in a constrained environment, AI agents must demonstrate social awareness and intelligence in games involving collaboration.

摘要:我們正式介紹一款名為 Connections 的即興文字遊戲,以探索 AI 代理的推理能力。玩 Connections 結合了知識檢索、總結能力和對其他代理的認知狀態的意識。我們展示了這款遊戲如何成為語言模型代理社交智能能力的良好基準,這超越了代理自身的記憶和推理能力,還涉及評估其他代理的理解能力。最後,我們展示了在受限環境中與其他代理的溝通中,AI 代理必須在涉及協作的遊戲中展現社會意識和智能。

A Study on the Impact of Fault localization Granularity for Repository-Scale Code Repair Tasks

2604.00167v1 by Joseph Townsend, Chandresh Pravin, Kwun Ho Ngan, Matthieu Parizy

Automatic program repair can be a challenging task, especially when resolving complex issues at a repository-level, which often involves issue reproduction, fault localization, code repair, testing and validation. Issues of this scale can be commonly found in popular GitHub repositories or datasets that are derived from them. Some repository-level approaches separate localization and repair into distinct phases. Where this is the case, the fault localization approaches vary in terms of the granularity of localization. Where the impact of granularity is explored to some degree for smaller datasets, not all isolate this issue from the separate question of localization accuracy by testing code repair under the assumption of perfect fault localization. To the best of the authors' knowledge, no repository-scale studies have explicitly investigated granularity under this assumption, nor conducted a systematic empirical comparison of granularity levels in isolation. We propose a framework for performing such tests by modifying the localization phase of the Agentless framework to retrieve ground-truth localization data and include this as context in the prompt fed to the repair phase. We show that under this configuration and as a generalization over the SWE-Bench-Mini dataset, function-level granularity yields the highest repair rate against line-level and file-level. However, a deeper dive suggests that the ideal granularity may in fact be task dependent. This study is not intended to improve on the state-of-the-art, nor do we intend for results to be compared against any complete agentic frameworks. Rather, we present a proof of concept for investigating how fault localization may impact automatic code repair in repository-scale scenarios. We present preliminary findings to this end and encourage further research into this relationship between the two phases.

摘要:自動程式修復可能是一項具有挑戰性的任務,尤其是在解決倉庫層級的複雜問題時,這通常涉及問題重現、故障定位、程式碼修復、測試和驗證。這種規模的問題通常可以在流行的 GitHub 倉庫或從中衍生的數據集中找到。
一些倉庫層級的方法將定位和修復分為不同的階段。在這種情況下,故障定位方法在定位的粒度上有所不同。雖然在較小的數據集中對粒度的影響進行了一定程度的探討,但並非所有方法都將此問題與在完美故障定位假設下測試程式碼修復的定位準確性分開。根據作者的最佳知識,尚無倉庫規模的研究明確調查在此假設下的粒度,也沒有對孤立的粒度水平進行系統的實證比較。
我們提出了一個框架來執行這些測試,通過修改無代理框架的定位階段來檢索真實的定位數據,並將其作為上下文納入提供給修復階段的提示中。我們顯示,在這種配置下,並且作為對 SWE-Bench-Mini 數據集的概括,函數級粒度相對於行級和文件級產生了最高的修復率。然而,深入探討表明,理想的粒度實際上可能依賴於任務。
本研究並不旨在改善現有技術,也不打算將結果與任何完整的代理框架進行比較。相反,我們提出了一個概念驗證,旨在調查故障定位如何影響倉庫規模場景中的自動程式碼修復。我們為此提出初步發現,並鼓勵進一步研究這兩個階段之間的關係。

Epileptic Seizure Detection in Separate Frequency Bands Using Feature Analysis and Graph Convolutional Neural Network (GCN) from Electroencephalogram (EEG) Signals

2604.00163v1 by Ferdaus Anam Jibon, Fazlul Hasan Siddiqui, F. Deeba, Gahangir Hossain

Epileptic seizures are neurological disorders characterized by abnormal and excessive electrical activity in the brain, resulting in recurrent seizure events. Electroencephalogram (EEG) signals are widely used for seizure diagnosis due to their ability to capture temporal and spatial neural dynamics. While recent deep learning methods have achieved high detection accuracy, they often lack interpretability and neurophysiological relevance. This study presents a frequency-aware framework for epileptic seizure detection based on ictal-phase EEG analysis. The raw EEG signals are decomposed into five frequency bands (delta, theta, alpha, lower beta, and higher beta), and eleven discriminative features are extracted from each band. A graph convolutional neural network (GCN) is then employed to model spatial dependencies among EEG electrodes, represented as graph nodes. Experiments on the CHB-MIT scalp EEG dataset demonstrate high detection performance, achieving accuracies of 97.1%, 97.13%, 99.5%, 99.7%, and 51.4% across the respective frequency bands, with an overall broadband accuracy of 99.01%. The results highlight the strong discriminative capability of mid-frequency bands and reveal frequency-specific seizure patterns. The proposed approach improves interpretability and diagnostic precision compared to conventional broadband EEG-based methods.

摘要:癲癇發作是神經系統疾病,其特徵是大腦中異常和過度的電活動,導致反覆的發作事件。腦電圖(EEG)信號因其能夠捕捉時間和空間的神經動態而被廣泛用於發作診斷。儘管最近的深度學習方法已達到高檢測準確率,但它們往往缺乏可解釋性和神經生理學的相關性。本研究提出了一種基於發作相EEG分析的頻率感知框架,用於癲癇發作檢測。原始EEG信號被分解為五個頻率帶(δ、θ、α、低β和高β),並從每個頻率帶中提取了十一個區分特徵。然後,使用圖卷積神經網絡(GCN)來建模EEG電極之間的空間依賴性,這些電極被表示為圖節點。在CHB-MIT頭皮EEG數據集上的實驗顯示出高檢測性能,分別在各頻率帶中達到97.1%、97.13%、99.5%、99.7%和51.4%的準確率,整體寬頻準確率為99.01%。結果突顯了中頻帶的強大區分能力,並揭示了頻率特定的發作模式。與傳統的基於寬頻EEG的方法相比,所提出的方法提高了可解釋性和診斷精度。

From Domain Understanding to Design Readiness: a playbook for GenAI-supported learning in Software Engineering

2604.00120v1 by Rafal Wlodarski

Software engineering courses often require rapid upskilling in supporting knowledge areas such as domain understanding and modeling methods. We report an experience from a two-week milestone in a master's course where 29 students used a customized ChatGPT (GPT-3.5) tutor grounded in a curated course knowledge base to learn cryptocurrency-finance basics and Domain-Driven Design (DDD). We logged all interactions and evaluated a 34.5% random sample of prompt-answer pairs (60/~174) with a five-dimension rubric (accuracy, relevance, pedagogical value, cognitive load, supportiveness), and we collected pre/post self-efficacy. Responses were consistently accurate and relevant in this setting: accuracy averaged 98.9% with no factual errors and only 2/60 minor inaccuracies, and relevance averaged 92.2%. Pedagogical value was high (89.4%) with generally appropriate cognitive load (82.78%), but supportiveness was low (37.78%). Students reported large pre-post self-efficacy gains for genAI-assisted domain learning and DDD application. From these observations we distill seventeen concrete teaching practices spanning prompt/configuration and course/workflow design (e.g., setting expected granularity, constraining verbosity, curating guardrail examples, adding small credit with a simple quality rubric). Within this single-course context, results suggest that genAI-supported learning can complement instruction in domain understanding and modeling tasks, while leaving room to improve tone and follow-up structure.

摘要:軟體工程課程通常需要在支援知識領域如領域理解和建模方法上迅速提升技能。我們報告了一個來自碩士課程的兩週里程碑的經驗,29名學生使用了一個基於精心策劃的課程知識庫的定制ChatGPT (GPT-3.5) 輔導工具來學習加密貨幣金融基礎和領域驅動設計(DDD)。我們記錄了所有互動,並用五維標準(準確性、相關性、教學價值、認知負荷、支持性)評估了34.5%的隨機樣本的提示-回答對(60/~174),並收集了前後的自我效能感。這個環境中的回答在準確性和相關性上都始終如一:準確性平均為98.9%,沒有事實錯誤,只有2/60個小錯誤,相關性平均為92.2%。教學價值很高(89.4%),認知負荷通常適當(82.78%),但支持性較低(37.78%)。學生報告了在genAI輔助的領域學習和DDD應用方面的顯著前後自我效能感增長。根據這些觀察,我們提煉出十七個具體的教學實踐,涵蓋提示/配置和課程/工作流程設計(例如,設定期望的粒度、限制冗長性、策劃護欄範例、添加小額信用並使用簡單的質量標準)。在這個單一課程的背景下,結果表明,genAI支持的學習可以補充領域理解和建模任務的教學,同時留下改進語氣和後續結構的空間。

ContextClaim: A Context-Driven Paradigm for Verifiable Claim Detection

2603.30025v1 by Yufeng Li, Rrubaa Panchendrarajan, Arkaitz Zubiaga

Verifiable claim detection asks whether a claim expresses a factual statement that can, in principle, be assessed against external evidence. As an early filtering stage in automated fact-checking, it plays an important role in reducing the burden on downstream verification components. However, existing approaches to claim detection, whether based on check-worthiness or verifiability, rely solely on the claim text itself. This is a notable limitation for verifiable claim detection in particular, where determining whether a claim is checkable may benefit from knowing what entities and events it refers to and whether relevant information exists to support verification. Inspired by the established role of evidence retrieval in later-stage claim verification, we propose Context-Driven Claim Detection (ContextClaim), a paradigm that advances retrieval to the detection stage. ContextClaim extracts entity mentions from the input claim, retrieves relevant information from Wikipedia as a structured knowledge source, and employs large language models to produce concise contextual summaries for downstream classification. We evaluate ContextClaim on two datasets covering different topics and text genres, the CheckThat! 2022 COVID-19 Twitter dataset and the PoliClaim political debate dataset, across encoder-only and decoder-only models under fine-tuning, zero-shot, and few-shot settings. Results show that context augmentation can improve verifiable claim detection, although its effectiveness varies across domains, model architectures, and learning settings. Through component analysis, human evaluation, and error analysis, we further examine when and why the retrieved context contributes to more reliable verifiability judgments.

摘要:可驗證的主張檢測詢問一個主張是否表達了一個事實陳述,該陳述原則上可以根據外部證據進行評估。作為自動事實核查中的早期篩選階段,它在減輕下游驗證組件的負擔方面發揮了重要作用。然而,現有的主張檢測方法,無論是基於可檢查性還是可驗證性,都僅依賴於主張文本本身。這對於可驗證的主張檢測來說是一個顯著的限制,因為確定一個主張是否可檢查可能受益於了解它所指的實體和事件,以及是否存在相關信息來支持驗證。受到證據檢索在後期主張驗證中既定角色的啟發,我們提出了基於上下文的主張檢測(ContextClaim),這是一種將檢索推進到檢測階段的範式。ContextClaim 從輸入主張中提取實體提及,從維基百科這一結構化知識來源中檢索相關信息,並運用大型語言模型生成簡明的上下文摘要以供下游分類使用。我們在涵蓋不同主題和文本類型的兩個數據集上評估 ContextClaim,包括 CheckThat! 2022 COVID-19 Twitter 數據集和 PoliClaim 政治辯論數據集,並在微調、零樣本和少樣本設置下對編碼器僅和解碼器僅模型進行測試。結果顯示,上下文增強可以改善可驗證的主張檢測,儘管其有效性在不同領域、模型架構和學習設置中有所變化。通過組件分析、人類評估和錯誤分析,我們進一步檢查檢索到的上下文何時以及為什麼會對更可靠的可驗證性判斷有所貢獻。

Physiological and Semantic Patterns in Medical Teams Using an Intelligent Tutoring System

2603.29950v1 by Xiaoshan Huang, Conrad Borchers, Jiayi Zhang, Susanne P. Lajoie

Effective collaboration requires teams to manage complex cognitive and emotional states through Socially Shared Regulation of Learning (SSRL). Physiological synchrony (i.e., longitudinal alignment in physiological signals) can indicate these states, but is hard to interpret on its own. We investigate the physiological and conversational dynamics of four medical dyads diagnosing a virtual patient case using an intelligent tutoring system. Semantic shifts in dialogue were correlated with transient physiological synchrony peaks. We also coded utterance segments for SSRL and derived cosine similarity using sentence embeddings. The results showed that activating prior knowledge featured significantly lower semantic similarity than simpler task execution. High physiological synchrony was associated with lower semantic similarity, suggesting that such moments involve exploratory and varied language use. Qualitative analysis triangulated these synchrony peaks as ``pivotal moments'': successful teams synchronized during shared discovery, while unsuccessful teams peaked during shared uncertainty. This research advances human-centered AI by demonstrating how biological signals can be fused with dialogues to understand critical moments in problem solving.

摘要:有效的合作需要團隊通過社會共享學習調節(SSRL)來管理複雜的認知和情感狀態。生理同步(即生理信號的縱向對齊)可以指示這些狀態,但單獨解釋起來較為困難。我們研究了四個醫療二人組在使用智能輔導系統診斷虛擬病人案例時的生理和對話動態。對話中的語義變化與瞬時生理同步峰值相關聯。我們還對發言片段進行了SSRL編碼,並使用句子嵌入推導了餘弦相似度。結果顯示,激活先前知識的語義相似度顯著低於較簡單的任務執行。高生理同步與較低的語義相似度相關,這表明這樣的時刻涉及探索性和多樣化的語言使用。質性分析將這些同步峰值三角測量為“關鍵時刻”:成功的團隊在共享發現時同步,而不成功的團隊在共享不確定性時達到峰值。本研究通過展示如何將生物信號與對話融合,以理解問題解決中的關鍵時刻,推進了以人為中心的人工智慧。

End-to-End Image Compression with Segmentation Guided Dual Coding for Wind Turbines

2603.29927v1 by Raül Pérez-Gonzalo, Andreas Espersen, Søren Forchhammer, Antonio Agudo

Transferring large volumes of high-resolution images during wind turbine inspections introduces a bottleneck in assessing and detecting severe defects. Efficient coding must preserve high fidelity in blade regions while aggressively compressing the background. In this work, we propose an end-to-end deep learning framework that jointly performs segmentation and dual-mode (lossy and lossless) compression. The segmentation module accurately identifies the blade region, after which our region-of-interest (ROI) compressor encodes it at superior quality compared to the rest of the image. Unlike conventional ROI schemes that merely allocate more bits to salient areas, our framework integrates: (i) a robust segmentation network (BU-Netv2+P) with a CRF-regularized loss for precise blade localization, (ii) a hyperprior-based autoencoder optimized for lossy compression, and (iii) an extended bits-back coder with hierarchical models for fully lossless blade reconstruction. Furthermore, our ROI framework removes the sequential dependency in bits-back coding by reusing background-coded bits, enabling parallelized and efficient dual-mode compression. To the best of our knowledge, this is the first fully integrated learning-based ROI codec combining segmentation, lossy, and lossless compression, ensuring that subsequent defect detection is not compromised. Experiments on a large-scale wind turbine dataset demonstrate superior compression performance and efficiency, offering a practical solution for automated inspections.

摘要:在風力發電機檢查過程中傳輸大量高解析度影像會造成評估和檢測嚴重缺陷的瓶頸。高效的編碼必須在刀片區域中保持高保真度,同時對背景進行強烈壓縮。在本研究中,我們提出了一個端到端的深度學習框架,該框架共同執行分割和雙模式(有損和無損)壓縮。分割模組準確識別刀片區域,之後我們的關注區域(ROI)壓縮器以比影像其餘部分更優質的方式編碼該區域。與僅僅將更多位元分配給顯著區域的傳統ROI方案不同,我們的框架整合了:(i)一個強健的分割網絡(BU-Netv2+P),配合CRF正則化損失以實現精確的刀片定位,(ii)一個基於超先驗的自編碼器,優化用於有損壓縮,以及(iii)一個擴展的位元回饋編碼器,配備層次模型以實現完全無損的刀片重建。此外,我們的ROI框架通過重用背景編碼的位元,消除了位元回饋編碼中的序列依賴性,實現了並行化和高效的雙模式壓縮。據我們所知,這是首個完全整合的基於學習的ROI編解碼器,結合了分割、有損和無損壓縮,確保後續的缺陷檢測不會受到影響。在一個大規模的風力發電機數據集上的實驗顯示出優越的壓縮性能和效率,提供了一個自動化檢查的實用解決方案。

SNEAK: Evaluating Strategic Communication and Information Leakage in Large Language Models

2603.29846v1 by Adar Avsian, Larry Heck

Large language models (LLMs) are increasingly deployed in multi-agent settings where communication must balance informativeness and secrecy. In such settings, an agent may need to signal information to collaborators while preventing an adversary from inferring sensitive details. However, existing LLM benchmarks primarily evaluate capabilities such as reasoning, factual knowledge, or instruction following, and do not directly measure strategic communication under asymmetric information. We introduce SNEAK (Secret-aware Natural language Evaluation for Adversarial Knowledge), a benchmark for evaluating selective information sharing in language models. In SNEAK, a model is given a semantic category, a candidate set of words, and a secret word, and must generate a message that indicates knowledge of the secret without revealing it too clearly. We evaluate generated messages using two simulated agents with different information states: an ally, who knows the secret and must identify the intended message, and a chameleon, who does not know the secret and attempts to infer it from the message. This yields two complementary metrics: utility, measuring how well the message communicates to collaborators, and leakage, measuring how much information it reveals to an adversary. Using this framework, we analyze the trade-off between informativeness and secrecy in modern language models and show that strategic communication under asymmetric information remains a challenging capability for current systems. Notably, human participants outperform all evaluated models by a large margin, achieving up to four times higher scores.

摘要:大型語言模型(LLMs)越來越多地應用於多代理環境中,在這些環境中,通信必須在信息性和保密性之間取得平衡。在這種情況下,代理可能需要向合作者傳達信息,同時防止對手推斷出敏感細節。然而,現有的LLM基準主要評估推理、事實知識或遵循指令等能力,而不直接測量在不對稱信息下的戰略通信。我們引入了SNEAK(針對對抗知識的秘密感知自然語言評估),這是一個用於評估語言模型中選擇性信息共享的基準。在SNEAK中,模型會獲得一個語義類別、一組候選詞和一個秘密詞,並必須生成一條消息,表明對秘密的了解,但又不會過於明顯地揭示它。我們使用兩個具有不同信息狀態的模擬代理來評估生成的消息:一個盟友,知道秘密並必須識別預期的消息,以及一個變色龍,對秘密一無所知,並試圖從消息中推斷出來。這產生了兩個互補的指標:效用,衡量消息對合作者的傳達效果,以及泄露,衡量它對對手透露了多少信息。利用這一框架,我們分析了現代語言模型中信息性和保密性之間的權衡,並顯示在不對稱信息下的戰略通信仍然是當前系統的一個挑戰能力。值得注意的是,參與者的表現遠超所有評估模型,分數高達四倍之多。

DIAL: Decoupling Intent and Action via Latent World Modeling for End-to-End VLA

2603.29844v1 by Yi Chen, Yuying Ge, Hui Zhou, Mingyu Ding, Yixiao Ge, Xihui Liu

The development of Vision-Language-Action (VLA) models has been significantly accelerated by pre-trained Vision-Language Models (VLMs). However, most existing end-to-end VLAs treat the VLM primarily as a multimodal encoder, directly mapping vision-language features to low-level actions. This paradigm underutilizes the VLM's potential in high-level decision making and introduces training instability, frequently degrading its rich semantic representations. To address these limitations, we introduce DIAL, a framework bridging high-level decision making and low-level motor execution through a differentiable latent intent bottleneck. Specifically, a VLM-based System-2 performs latent world modeling by synthesizing latent visual foresight within the VLM's native feature space; this foresight explicitly encodes intent and serves as the structural bottleneck. A lightweight System-1 policy then decodes this predicted intent together with the current observation into precise robot actions via latent inverse dynamics. To ensure optimization stability, we employ a two-stage training paradigm: a decoupled warmup phase where System-2 learns to predict latent futures while System-1 learns motor control under ground-truth future guidance within a unified feature space, followed by seamless end-to-end joint optimization. This enables action-aware gradients to refine the VLM backbone in a controlled manner, preserving pre-trained knowledge. Extensive experiments on the RoboCasa GR1 Tabletop benchmark show that DIAL establishes a new state-of-the-art, achieving superior performance with 10x fewer demonstrations than prior methods. Furthermore, by leveraging heterogeneous human demonstrations, DIAL learns physically grounded manipulation priors and exhibits robust zero-shot generalization to unseen objects and novel configurations during real-world deployment on a humanoid robot.

摘要:VLA(視覺-語言-行動)模型的發展受到預訓練的視覺-語言模型(VLM)的顯著加速。然而,大多數現有的端到端VLA將VLM主要視為多模態編碼器,直接將視覺-語言特徵映射到低層次行動。這種範式未能充分利用VLM在高層次決策中的潛力,並引入訓練不穩定性,經常降低其豐富的語義表徵。為了解決這些限制,我們引入了DIAL,一個通過可微分的潛在意圖瓶頸來橋接高層次決策和低層次運動執行的框架。具體而言,基於VLM的系統-2通過在VLM的原生特徵空間內合成潛在的視覺預見來執行潛在世界建模;這種預見明確編碼了意圖並作為結構瓶頸。一個輕量級的系統-1策略然後將這個預測的意圖與當前觀察一起解碼為精確的機器人行動,通過潛在的逆動力學。為了確保優化的穩定性,我們採用兩階段的訓練範式:一個解耦的熱身階段,其中系統-2學習預測潛在的未來,而系統-1在統一特徵空間內在真實未來指導下學習運動控制,隨後進行無縫的端到端聯合優化。這使得行動感知的梯度能夠以受控的方式精煉VLM骨幹,保留預訓練知識。在RoboCasa GR1桌面基準上的廣泛實驗顯示,DIAL建立了新的最先進技術,以比先前方法少10倍的示範實現卓越的性能。此外,通過利用異質的人類示範,DIAL學習了物理基礎的操作先驗,並在類人機器人的現實世界部署中對未見物體和新配置展現出強大的零樣本泛化能力。

ENEIDE: A High Quality Silver Standard Dataset for Named Entity Recognition and Linking in Historical Italian

2603.29801v1 by Cristian Santini, Sebastian Barzaghi, Paolo Sernani, Emanuele Frontoni, Laura Melosi, Mehwish Alam

This paper introduces ENEIDE (Extracting Named Entities from Italian Digital Editions), a silver standard dataset for Named Entity Recognition and Linking (NERL) in historical Italian texts. The corpus comprises 2,111 documents with over 8,000 entity annotations semi-automatically extracted from two scholarly digital editions: Digital Zibaldone, the philosophical diary of the Italian poet Giacomo Leopardi (1798--1837), and Aldo Moro Digitale, the complete works of the Italian politician Aldo Moro (1916--1978). Annotations cover multiple entity types (person, location, organization, literary work) linked to Wikidata identifiers, including NIL entities that cannot be mapped to the knowledge graph. To the best of our knowledge, ENEIDE represents the first multi-domain, publicly available NERL dataset for historical Italian with training, development, and test splits. We present a methodology for semi-automatic annotations extraction from manually curated scholarly digital editions, including quality control and annotation enhancement procedures. Baseline experiments using state-of-the-art models demonstrate the dataset's challenge for NERL and the gap between zero-shot approaches and fine-tuned models. The dataset's diachronic coverage spanning two centuries makes it particularly suitable for temporal entity disambiguation and cross-domain evaluation. ENEIDE is released under a CC BY-NC-SA 4.0 license.

摘要:這篇論文介紹了ENEIDE(從意大利數字版本提取命名實體),這是一個針對歷史意大利文本的命名實體識別和鏈接(NERL)的銀標準數據集。該語料庫包含2111份文件,並從兩個學術數字版本中半自動提取了超過8000個實體註釋:意大利詩人賈科莫·萊奧帕爾迪(1798--1837)的哲學日記《數字Zibaldone》和意大利政治家阿爾多·莫羅(1916--1978)的完整著作《阿爾多·莫羅數字版》。註釋涵蓋多種類型的實體(人、地點、組織、文學作品),並與Wikidata標識符相連,包括無法映射到知識圖譜的NIL實體。據我們所知,ENEIDE代表了第一個多領域、公開可用的歷史意大利語NERL數據集,並且具有訓練、開發和測試的劃分。我們提出了一種從手動編輯的學術數字版本中半自動提取註釋的方法,包括質量控制和註釋增強程序。使用最先進模型的基準實驗顯示了該數據集對NERL的挑戰,以及零樣本方法與微調模型之間的差距。該數據集跨越兩個世紀的歷時覆蓋使其特別適合於時間實體消歧和跨領域評估。ENEIDE以CC BY-NC-SA 4.0許可釋出。

Training-Free Dynamic Upcycling of Expert Language Models

2603.29765v1 by Eros Fanì, Oğuzhan Ersoy

Large Language Models (LLMs) have achieved remarkable performance on a wide range of specialized tasks, exhibiting strong problem-solving capabilities. However, training these models is prohibitively expensive, and they often lack domain-specific expertise because they rely on general knowledge datasets. Expertise finetuning can address this issue; however, it often leads to overspecialization, and developing a single multi-domain expert remains difficult due to diverging objectives. Furthermore, multitask training is challenging due to interference and catastrophic forgetting. Existing work proposes combining the expertise of dense models within a Mixture of Experts (MoE) architecture, although this approach still requires multitask finetuning. To address these issues, we introduce Dynamic Upcycling MoE (DUME), a novel approach that reuses dense experts trained on different domains to construct a unified MoE model. Our method builds a single multitask model that preserves the capabilities of the original dense experts without requiring additional training. DUME is both cost-efficient and scalable: by leveraging the closed-form solution of ridge regression, it eliminates the need for further optimization and enables experts to be added dynamically while maintaining the model's original performance. We demonstrate that DUME consistently outperforms baseline approaches in both causal language modeling and reasoning settings. Finally, we also show that the DUME model can be fine-tuned to further improve performance. We show that, in the causal language modeling setting, DUME can retain up to 97.6% of a dense expert model specialized in one particular domain, and that it can also surpass it in the reasoning setting, where it can achieve 102.1% of the dense expert performance. Our code is available at: github.com/gensyn-ai/dume.

摘要:大型語言模型(LLMs)在各種專業任務上取得了卓越的表現,展現出強大的問題解決能力。然而,訓練這些模型的成本極高,且由於依賴於一般知識數據集,它們通常缺乏領域特定的專業知識。專業知識微調可以解決這個問題;然而,它往往導致過度專業化,並且由於目標的分歧,開發單一的多領域專家仍然困難重重。此外,因為干擾和災難性遺忘,多任務訓練也充滿挑戰。現有的研究提出在混合專家(MoE)架構中結合密集模型的專業知識,儘管這種方法仍然需要多任務微調。為了解決這些問題,我們引入了動態升級混合專家(DUME),這是一種新穎的方法,重用在不同領域訓練的密集專家來構建統一的MoE模型。我們的方法建立了一個單一的多任務模型,保留了原始密集專家的能力,而不需要額外的訓練。DUME既具成本效益又具可擴展性:通過利用脊回歸的封閉形式解決方案,它消除了進一步優化的需求,並使專家能夠動態添加,同時保持模型的原始性能。我們展示了DUME在因果語言建模和推理設置中始終超越基準方法。最後,我們還顯示DUME模型可以進行微調以進一步提高性能。我們表明,在因果語言建模設置中,DUME可以保留高達97.6%的專門於某一特定領域的密集專家模型,並且在推理設置中也能超越它,達到102.1%的密集專家性能。我們的代碼可在以下網址獲得:github.com/gensyn-ai/dume。

KEditVis: A Visual Analytics System for Knowledge Editing of Large Language Models

2603.29689v1 by Zhenning Chen, Hanbei Zhan, Yanwei Huang, Xin Wu, Dazhen Deng, Di Weng, Yingcai Wu

Large Language Models (LLMs) demonstrate exceptional capabilities in factual question answering, yet they sometimes provide incorrect responses. To address this issue, knowledge editing techniques have emerged as effective methods for correcting factual information in LLMs. However, typical knowledge editing workflows struggle with identifying the optimal set of model layers for editing and rely on summary indicators that provide insufficient guidance. This lack of transparency hinders effective comparison and identification of optimal editing strategies. In this paper, we present KEditVis, a novel visual analytics system designed to assist users in gaining a deeper understanding of knowledge editing through interactive visualizations, improving editing outcomes, and discovering valuable insights for the future development of knowledge editing algorithms. With KEditVis, users can select appropriate layers as the editing target, explore the reasons behind ineffective edits, and perform more targeted and effective edits. Our evaluation, including usage scenarios, expert interviews, and a user study, validates the effectiveness and usability of the system.

摘要:大型語言模型(LLMs)在事實問題回答方面展現出卓越的能力,但有時會提供不正確的回答。為了解決這個問題,知識編輯技術已成為修正LLMs中事實信息的有效方法。然而,典型的知識編輯工作流程在識別最佳的模型層集以進行編輯方面存在困難,並依賴於提供不足指導的摘要指標。這種缺乏透明度妨礙了有效的比較和最佳編輯策略的識別。在本文中,我們提出了KEditVis,一個新穎的視覺分析系統,旨在通過互動式可視化幫助用戶深入理解知識編輯,改善編輯結果,並發現對未來知識編輯算法開發有價值的見解。使用KEditVis,用戶可以選擇適當的層作為編輯目標,探索無效編輯背後的原因,並進行更具針對性和有效的編輯。我們的評估,包括使用場景、專家訪談和用戶研究,驗證了該系統的有效性和可用性。

Beyond the Steeper Curve: AI-Mediated Metacognitive Decoupling and the Limits of the Dunning-Kruger Metaphor

2603.29681v1 by Christopher Koch

The common claim that generative AI simply amplifies the Dunning-Kruger effect is too coarse to capture the available evidence. The clearest findings instead suggest that large language model (LLM) use can improve observable output and short-term task performance while degrading metacognitive accuracy and flattening the classic competence-confidence gradient across skill groups. This paper synthesizes evidence from human-AI interaction, learning research, and model evaluation, and proposes the working model of AI-mediated metacognitive decoupling: a widening gap among produced output, underlying understanding, calibration accuracy, and self-assessed ability. This four-variable account better explains overconfidence, over- and under-reliance, crutch effects, and weak transfer than the simpler metaphor of a uniformly steeper Dunning-Kruger curve. The paper concludes with implications for tool design, assessment, and knowledge work.

摘要:一般聲稱生成式 AI 只是放大了鄧寇克效應的說法過於粗糙,無法捕捉現有的證據。相反,最明確的研究結果表明,大型語言模型 (LLM) 的使用可以改善可觀察的輸出和短期任務表現,同時降低元認知的準確性,並在技能組之間平坦化經典的能力-信心梯度。本文綜合了來自人機互動、學習研究和模型評估的證據,並提出了 AI 介導的元認知解耦的工作模型:產出、基礎理解、校準準確性和自我評估能力之間的差距擴大。這四個變量的解釋比簡單的均勻陡峭的鄧寇克曲線更好地解釋了過度自信、過度和不足依賴、拐杖效應以及弱轉移。本文最後討論了對工具設計、評估和知識工作的影響。

A Comprehensive Information-Decomposition Analysis of Large Vision-Language Models

2603.29676v1 by Lixin Xiu, Xufang Luo, Hideki Nakayama

Large vision-language models (LVLMs) achieve impressive performance, yet their internal decision-making processes remain opaque, making it difficult to determine if the success stems from true multimodal fusion or from reliance on unimodal priors. To address this attribution gap, we introduce a novel framework using partial information decomposition (PID) to quantitatively measure the "information spectrum" of LVLMs -- decomposing a model's decision-relevant information into redundant, unique, and synergistic components. By adapting a scalable estimator to modern LVLM outputs, our model-agnostic pipeline profiles 26 LVLMs on four datasets across three dimensions -- breadth (cross-model & cross-task), depth (layer-wise information dynamics), and time (learning dynamics across training). Our analysis reveals two key results: (i) two task regimes (synergy-driven vs. knowledge-driven) and (ii) two stable, contrasting family-level strategies (fusion-centric vs. language-centric). We also uncover a consistent three-phase pattern in layer-wise processing and identify visual instruction tuning as the key stage where fusion is learned. Together, these contributions provide a quantitative lens beyond accuracy-only evaluation and offer insights for analyzing and designing the next generation of LVLMs. Code and data are available at https://github.com/RiiShin/pid-lvlm-analysis .

摘要:大型視覺語言模型(LVLMs)表現出色,但其內部決策過程仍然不透明,這使得難以判斷其成功是否源於真正的多模態融合或依賴單一模態的先驗知識。為了解決這一歸因差距,我們引入了一個新穎的框架,使用部分信息分解(PID)來定量測量LVLM的「信息光譜」——將模型決策相關的信息分解為冗餘、獨特和協同的組件。通過將可擴展的估計器適應於現代LVLM輸出,我們的模型無關管道在三個維度上對26個LVLM進行了分析——廣度(跨模型和跨任務)、深度(層級信息動態)和時間(訓練過程中的學習動態)。我們的分析揭示了兩個關鍵結果:(i)兩種任務模式(以協同為驅動 vs. 以知識為驅動)和(ii)兩種穩定、對比的家族級策略(以融合為中心 vs. 以語言為中心)。我們還發現了層級處理中一致的三階段模式,並確定視覺指令調整是學習融合的關鍵階段。總體而言,這些貢獻提供了一個超越僅僅準確性評估的定量視角,並為分析和設計下一代LVLM提供了見解。代碼和數據可在 https://github.com/RiiShin/pid-lvlm-analysis 獲得。

ASI-Evolve: AI Accelerates AI

2603.29640v1 by Weixian Xu, Tiantian Mi, Yixiu Liu, Yang Nan, Zhimeng Zhou, Lyumanshan Ye, Lin Zhang, Yu Qiao, Pengfei Liu

Can AI accelerate the development of AI itself? While recent agentic systems have shown strong performance on well-scoped tasks with rapid feedback, it remains unclear whether they can tackle the costly, long-horizon, and weakly supervised research loops that drive real AI progress. We present ASI-Evolve, an agentic framework for AI-for-AI research that closes this loop through a learn-design-experiment-analyze cycle. ASI-Evolve augments standard evolutionary agents with two key components: a cognition base that injects accumulated human priors into each round of exploration, and a dedicated analyzer that distills complex experimental outcomes into reusable insights for future iterations. To our knowledge, ASI-Evolve is the first unified framework to demonstrate AI-driven discovery across three central components of AI development: data, architectures, and learning algorithms. In neural architecture design, it discovered 105 SOTA linear attention architectures, with the best discovered model surpassing DeltaNet by +0.97 points, nearly 3x the gain of recent human-designed improvements. In pretraining data curation, the evolved pipeline improves average benchmark performance by +3.96 points, with gains exceeding 18 points on MMLU. In reinforcement learning algorithm design, discovered algorithms outperform GRPO by up to +12.5 points on AMC32, +11.67 points on AIME24, and +5.04 points on OlympiadBench. We further provide initial evidence that this AI-for-AI paradigm can transfer beyond the AI stack through experiments in mathematics and biomedicine. Together, these results suggest that ASI-Evolve represents a promising step toward enabling AI to accelerate AI across the foundational stages of development, offering early evidence for the feasibility of closed-loop AI research.

摘要:AI能否加速自身的發展?儘管近期的代理系統在快速反饋的明確任務上表現出色,但尚不清楚它們是否能夠應對推動真正AI進步的高成本、長期和弱監督研究循環。我們提出了ASI-Evolve,一個用於AI-for-AI研究的代理框架,通過學習-設計-實驗-分析循環來閉合這一循環。ASI-Evolve增強了標準進化代理,加入了兩個關鍵組件:一個認知基礎,將累積的人類先驗知識注入每一輪探索,以及一個專門的分析器,將複雜的實驗結果提煉為可重用的洞見,用於未來的迭代。據我們所知,ASI-Evolve是第一個統一框架,展示了在AI發展的三個核心組件上進行AI驅動的發現:數據、架構和學習算法。在神經架構設計中,它發現了105種SOTA線性注意力架構,最佳發現的模型超越了DeltaNet,提升了+0.97分,幾乎是近期人類設計改進的3倍增益。在預訓練數據策劃中,進化的管道將平均基準性能提高了+3.96分,並在MMLU上獲得超過18分的增益。在強化學習算法設計中,發現的算法在AMC32上超越GRPO達+12.5分,在AIME24上+11.67分,在OlympiadBench上+5.04分。我們進一步提供初步證據,顯示這一AI-for-AI範式可以通過數學和生物醫學的實驗轉移到AI堆棧之外。綜合這些結果,表明ASI-Evolve代表了一個有希望的步驟,能夠促進AI在發展的基礎階段加速AI,為閉環AI研究的可行性提供早期證據。

FlowPIE: Test-Time Scientific Idea Evolution with Flow-Guided Literature Exploration

2603.29557v1 by Qiyao Wang, Hongbo Wang, Longze Chen, Zhihao Yang, Guhong Chen, Hamid Alinejad-Rokny, Hui Li, Yuan Lin, Min Yang

Scientific idea generation (SIG) is critical to AI-driven autonomous research, yet existing approaches are often constrained by a static retrieval-then-generation paradigm, leading to homogeneous and insufficiently divergent ideas. In this work, we propose FlowPIE, a tightly coupled retrieval-generation framework that treats literature exploration and idea generation as a co-evolving process. FlowPIE expands literature trajectories via a flow-guided Monte Carlo Tree Search (MCTS) inspired by GFlowNets, using the quality of current ideas assessed by an LLM-based generative reward model (GRM) as a supervised signal to guide adaptive retrieval and construct a diverse, high-quality initial population. Based on this population, FlowPIE models idea generation as a test-time idea evolution process, applying selection, crossover, and mutation with the isolation island paradigm and GRM-based fitness computation to incorporate cross-domain knowledge. It effectively mitigates the information cocoons arising from over-reliance on parametric knowledge and static literature. Extensive evaluations demonstrate that FlowPIE consistently produces ideas with higher novelty, feasibility and diversity compared to strong LLM-based and agent-based frameworks, while enabling reward scaling during test time.

摘要:科學創意生成(SIG)對於以AI驅動的自主研究至關重要,然而現有的方法往往受到靜態檢索再生成範式的限制,導致創意同質且不足夠多樣。在本研究中,我們提出了FlowPIE,一個緊密耦合的檢索-生成框架,將文獻探索和創意生成視為共同演化的過程。FlowPIE通過一種受GFlowNets啟發的流引導蒙地卡羅樹搜索(MCTS)擴展文獻軌跡,利用基於LLM的生成獎勵模型(GRM)評估當前創意的質量作為監督信號,以引導自適應檢索並構建多樣化、高質量的初始種群。基於這個種群,FlowPIE將創意生成建模為測試時的創意演化過程,應用選擇、交叉和突變,結合孤立島嶼範式和基於GRM的適應度計算,以融入跨領域知識。它有效減輕了因過度依賴參數知識和靜態文獻而產生的信息茧。廣泛的評估顯示,FlowPIE持續產生比強大的基於LLM和基於代理的框架更具新穎性、可行性和多樣性的創意,同時在測試期間實現獎勵的擴展。

Bringing Up a Bilingual BabyLM: Investigating Multilingual Language Acquisition Using Small-Scale Models

2603.29552v1 by Linda Zeng, Steven Y. Feng, Michael C. Frank

Multilingualism is incredibly common around the world, leading to many important theoretical and practical questions about how children learn multiple languages at once. For example, does multilingual acquisition lead to delays in learning? Are there better and worse ways to structure multilingual input? Many correlational studies address these questions, but it is surprisingly difficult to get definitive answers because children cannot be randomly assigned to be multilingual and data are typically not matched between languages. We use language model training as a method for simulating a variety of highly controlled exposure conditions, and create matched 100M-word mono- and bilingual datasets using synthetic data and machine translation. We train GPT-2 models on monolingual and bilingual data organized to reflect a range of exposure regimes, and evaluate their performance on perplexity, grammaticality, and semantic knowledge. Across model scales and measures, bilingual models perform similarly to monolingual models in one language, but show strong performance in the second language as well. These results suggest that there are no strong differences between different bilingual exposure regimes, and that bilingual input poses no in-principle challenges for agnostic statistical learners.

摘要:多語言能力在世界各地非常普遍,這引發了許多關於兒童如何同時學習多種語言的重要理論和實踐問題。例如,多語言習得是否會導致學習延遲?有沒有更好或更差的方式來結構多語言輸入?許多相關研究探討了這些問題,但由於兒童無法隨機分配為多語言學習者,且數據通常在語言之間不匹配,因此獲得明確答案意外地困難。我們使用語言模型訓練作為模擬各種高度控制的接觸條件的方法,並使用合成數據和機器翻譯創建匹配的1億字單語和雙語數據集。我們在組織以反映各種接觸模式的單語和雙語數據上訓練GPT-2模型,並評估其在困惑度、語法性和語義知識上的表現。在模型規模和測量方面,雙語模型在一種語言上的表現與單語模型相似,但在第二語言上也顯示出強勁的表現。這些結果表明,不同的雙語接觸模式之間沒有明顯的差異,且雙語輸入對於無偏見的統計學習者來說並不存在原則上的挑戰。

Quantization with Unified Adaptive Distillation to enable multi-LoRA based one-for-all Generative Vision Models on edge

2603.29535v1 by Sowmya Vajrala, Aakash Parmar, Prasanna R, Sravanth Kodavanti, Manjunath Arveti, Srinivas Soumitri Miriyala, Ashok Senapati

Generative Artificial Intelligence (GenAI) features such as image editing, object removal, and prompt-guided image transformation are increasingly integrated into mobile applications. However, deploying Large Vision Models (LVMs) for such tasks on resource-constrained devices remains challenging due to their high memory and compute requirements. While Low-Rank Adapters (LoRAs) enable parameter-efficient task adaptation, existing Mobile deployment pipelines typically compile separate model binaries for each LoRA + a copy of the foundation model, resulting in redundant storage and increased runtime overhead. In this work, we present a unified framework for enabling multi-task GenAI inference on edge devices using a single shared model. Our key idea is to treat LoRA weights as runtime inputs rather than embedding them into the compiled model graph, allowing dynamic task switching at runtime without recompilation. Then, to support efficient on-device execution, we introduce QUAD (Quantization with Unified Adaptive Distillation), a quantizationaware training strategy that aligns multiple LoRA adapters under a shared quantization profile. We implement the proposed system with a lightweight runtime stack compatible with mobile NPUs and evaluate it across multiple chipsets. Experimental results demonstrate up to 6x and 4x reduction in memory footprint and latency improvements, respectively, while maintaining high visual quality across multiple GenAI tasks.

摘要:生成式人工智慧(GenAI)如圖像編輯、物件移除和提示引導的圖像轉換等功能,正逐漸整合進入行動應用程式中。然而,因為大型視覺模型(LVMs)在資源受限的設備上執行這些任務仍然面臨挑戰,這是由於其高記憶體和計算需求。雖然低秩適配器(LoRAs)使得參數高效的任務適應成為可能,但現有的行動部署管道通常為每個LoRA編譯單獨的模型二進位檔和一份基礎模型的副本,導致冗餘的儲存和增加的執行時間開銷。在本研究中,我們提出了一個統一的框架,以便在邊緣設備上使用單一共享模型進行多任務GenAI推斷。我們的關鍵想法是將LoRA權重視為運行時輸入,而不是將其嵌入到編譯的模型圖中,這樣可以在運行時實現動態任務切換而無需重新編譯。然後,為了支持高效的設備內執行,我們引入了QUAD(統一自適應蒸餾的量化),這是一種量化感知的訓練策略,能夠在共享的量化配置文件下對多個LoRA適配器進行對齊。我們用一個輕量級的運行時堆疊實現了所提出的系統,該堆疊與行動NPU相容,並在多個晶片組上進行評估。實驗結果顯示,在記憶體佔用和延遲改進方面,分別達到高達6倍和4倍的減少,同時在多個GenAI任務中保持高視覺品質。

Baby Scale: Investigating Models Trained on Individual Children's Language Input

2603.29522v1 by Steven Y. Feng, Alvin W. M. Tan, Michael C. Frank

Modern language models (LMs) must be trained on many orders of magnitude more words of training data than human children receive before they begin to produce useful behavior. Assessing the nature and origins of this "data gap" requires benchmarking LMs on human-scale datasets to understand how linguistic knowledge emerges from children's natural training data. Using transcripts from the BabyView dataset (videos from children ages 6-36 months), we investigate (1) scaling performance at child-scale data regimes, (2) variability in model performance across datasets from different children's experiences and linguistic predictors of dataset quality, and (3) relationships between model and child language learning outcomes. LMs trained on child data show acceptable scaling for grammar tasks, but lower scaling on semantic and world knowledge tasks than models trained on synthetic data; we also observe substantial variability on data from different children. Beyond dataset size, performance is most associated with a combination of distributional and interactional linguistic features, broadly consistent with what makes high-quality input for child language development. Finally, model likelihoods for individual words correlate with children's learning of those words, suggesting that properties of child-directed input may influence both model learning and human language development. Overall, understanding what properties make language data efficient for learning can enable more powerful small-scale language models while also shedding light on human language acquisition.

摘要:現代語言模型(LMs)必須在比人類兒童開始產生有用行為之前,訓練更多數量級的訓練數據。評估這個「數據差距」的性質和起源需要在與人類規模相當的數據集上對LMs進行基準測試,以了解語言知識如何從兒童的自然訓練數據中出現。利用BabyView數據集的逐字稿(來自6至36個月大兒童的視頻),我們調查了(1)在兒童規模數據體系下的性能擴展,(2)來自不同兒童經驗的數據集之間模型性能的變異性以及數據集質量的語言預測因子,和(3)模型與兒童語言學習結果之間的關係。基於兒童數據訓練的LMs在語法任務上顯示出可接受的擴展性,但在語義和世界知識任務上的擴展性低於基於合成數據訓練的模型;我們還觀察到來自不同兒童的數據存在顯著的變異性。除了數據集大小外,性能最與分佈性和互動性語言特徵的組合相關,這與促進兒童語言發展的高質量輸入的特徵大致一致。最後,單個單詞的模型可能性與兒童對這些單詞的學習相關,這表明針對兒童的輸入特性可能影響模型學習和人類語言發展。總的來說,理解哪些特性使語言數據對學習有效,可以促進更強大的小規模語言模型,同時也能揭示人類語言習得的過程。

Impact of enriched meaning representations for language generation in dialogue tasks: A comprehensive exploration of the relevance of tasks, corpora and metrics

2603.29518v1 by Alain Vázquez, Maria Inés Torres

Conversational systems should generate diverse language forms to interact fluently and accurately with users. In this context, Natural Language Generation (NLG) engines convert Meaning Representations (MRs) into sentences, directly influencing user perception. These MRs usually encode the communicative function (e.g., inform, request, confirm) via DAs and enumerate the semantic content with slot-value pairs. In this work, our objective is to analyse whether providing a task demonstrator to the generator enhances the generations of a fine-tuned model. This demonstrator is an MR-sentence pair extracted from the original dataset that enriches the input at training and inference time. The analysis involves five metrics that focus on different linguistic aspects, and four datasets that differ in multiple features, such as domain, size, lexicon, MR variability, and acquisition process. To the best of our knowledge, this is the first study on dialogue NLG implementing a comparative analysis of the impact of MRs on generation quality across domains, corpus characteristics, and the metrics used to evaluate these generations. Our key insight is that the proposed enriched inputs are effective for complex tasks and small datasets with high variability in MRs and sentences. They are also beneficial in zero-shot settings for any domain. Moreover, the analysis of the metrics shows that semantic metrics capture generation quality more accurately than lexical metrics. In addition, among these semantic metrics, those trained with human ratings can detect omissions and other subtle semantic issues that embedding-based metrics often miss. Finally, the evolution of the metric scores and the excellent results for Slot Accuracy and Dialogue Act Accuracy demonstrate that the generative models present fast adaptability to different tasks and robustness at semantic and communicative intention levels.

摘要:對話系統應該生成多樣的語言形式,以便與用戶流暢且準確地互動。在這個背景下,自然語言生成(NLG)引擎將意義表示(MRs)轉換為句子,直接影響用戶的感知。這些MR通常通過對話行為(DAs)編碼傳達功能(例如,通知、請求、確認),並用槽位-值對列舉語義內容。在本研究中,我們的目標是分析向生成器提供任務示範是否能增強微調模型的生成效果。這個示範是一對從原始數據集中提取的MR-句子對,旨在豐富訓練和推理時的輸入。分析涉及五個指標,專注於不同的語言學方面,以及四個在多個特徵上有所不同的數據集,例如領域、大小、詞彙、MR變異性和獲取過程。據我們所知,這是第一項針對對話NLG進行的比較分析研究,探討MR對生成質量的影響,涵蓋不同領域、語料特徵及用於評估這些生成的指標。我們的主要見解是,所提出的豐富輸入對於複雜任務和具有高變異性的MR和句子的小數據集是有效的。它們在任何領域的零樣本設置中也具有益處。此外,指標的分析顯示,語義指標比詞彙指標更準確地捕捉生成質量。此外,在這些語義指標中,經過人類評分訓練的指標能夠檢測到遺漏和其他微妙的語義問題,而基於嵌入的指標常常會忽略這些問題。最後,指標分數的演變以及在槽位準確度和對話行為準確度上取得的優異結果顯示,生成模型對不同任務具有快速適應性,並在語義和交際意圖層面上展現出穩健性。

Structural Compactness as a Complementary Criterion for Explanation Quality

2603.29491v1 by Mohammad Mahdi Mesgari, Jackie Ma, Wojciech Samek, Sebastian Lapuschkin, Leander Weber

In the evaluation of attribution quality, the quantitative assessment of explanation legibility is particularly difficult, as it is influenced by varying shapes and internal organization of attributions not captured by simple statistics. To address this issue, we introduce Minimum Spanning Tree Compactness (MST-C), a graph-based structural metric that captures higher-order geometric properties of attributions, such as spread and cohesion. These components are combined into a single score that evaluates compactness, favoring attributions with salient points spread across a small area and spatially organized into few but cohesive clusters. We show that MST-C reliably distinguishes between explanation methods, exposes fundamental structural differences between models, and provides a robust, self-contained diagnostic for explanation compactness that complements existing notions of attribution complexity.

摘要:在評估歸因質量時,對解釋可讀性的定量評估特別困難,因為它受到不同形狀和內部組織的歸因影響,而這些並未被簡單統計所捕捉。為了解決這個問題,我們引入了最小生成樹緊湊度(MST-C),這是一種基於圖的結構度量,捕捉歸因的高階幾何特性,例如擴散和凝聚。這些組件被結合成一個單一的分數,以評估緊湊度,偏好在小範圍內分佈的顯著點並在空間上組織成少數但凝聚的集群。我們顯示MST-C能夠可靠地區分解釋方法,揭示模型之間的基本結構差異,並提供一個穩健的、自包含的診斷工具,用於解釋緊湊度,這補充了現有的歸因複雜性概念。

iPoster: Content-Aware Layout Generation for Interactive Poster Design via Graph-Enhanced Diffusion Models

2603.29469v1 by Xudong Zhou, Jinyuan Liang, Qiuyi Guo, Guozheng Li

We present iPoster, an interactive layout generation framework that empowers users to guide content-aware poster layout design by specifying flexible constraints. iPoster enables users to specify partial intentions within the intention module, such as element categories, sizes, positions, or coarse initial drafts. Then, the generation module instantly generates refined, context-sensitive layouts that faithfully respect these constraints. iPoster employs a unified graph-enhanced diffusion architecture that supports various design tasks under user-specified constraints. These constraints are enforced through masking strategies that precisely preserve user input at every denoising step. A cross content-aware attention module aligns generated elements with salient regions of the canvas, ensuring visual coherence. Extensive experiments show that iPoster not only achieves state-of-the-art layout quality, but offers a responsive and controllable framework for poster layout design with constraints.

摘要:我們提出了 iPoster,一個互動式佈局生成框架,使得用戶能夠通過指定靈活的約束來引導內容感知的海報佈局設計。iPoster 使得用戶能在意圖模塊中指定部分意圖,例如元素類別、大小、位置或粗略的初步草稿。然後,生成模塊即時生成精緻的、上下文敏感的佈局,忠實地遵循這些約束。iPoster 採用統一的圖增強擴散架構,支持在用戶指定約束下的各種設計任務。這些約束通過遮罩策略來強制執行,精確保留用戶在每個去噪步驟中的輸入。一個跨內容感知的注意力模塊將生成的元素與畫布的顯著區域對齊,確保視覺的一致性。大量實驗表明,iPoster 不僅達到了最先進的佈局質量,還提供了一個響應迅速且可控的海報佈局設計框架,具有約束條件。

PRISM: PRIor from corpus Statistics for topic Modeling

2603.29406v1 by Tal Ishon, Yoav Goldberg, Uri Shaham

Topic modeling seeks to uncover latent semantic structure in text, with LDA providing a foundational probabilistic framework. While recent methods often incorporate external knowledge (e.g., pre-trained embeddings), such reliance limits applicability in emerging or underexplored domains. We introduce \textbf{PRISM}, a corpus-intrinsic method that derives a Dirichlet parameter from word co-occurrence statistics to initialize LDA without altering its generative process. Experiments on text and single cell RNA-seq data show that PRISM improves topic coherence and interpretability, rivaling models that rely on external knowledge. These results underscore the value of corpus-driven initialization for topic modeling in resource-constrained settings. Code is available at: https://github.com/shaham-lab/PRISM.

摘要:主題建模旨在揭示文本中的潛在語義結構,LDA 提供了一個基礎的概率框架。雖然最近的方法通常會納入外部知識(例如,預訓練的嵌入),但這種依賴限制了在新興或未充分探索領域的應用。我們介紹了 \textbf{PRISM},這是一種內在於語料庫的方法,通過詞語共現統計來推導 Dirichlet 參數,以在不改變其生成過程的情況下初始化 LDA。在文本和單細胞 RNA-seq 數據上的實驗顯示,PRISM 改善了主題的一致性和可解釋性,與依賴外部知識的模型相媲美。這些結果強調了在資源受限環境中,基於語料庫的初始化對主題建模的價值。代碼可在以下網址獲得: https://github.com/shaham-lab/PRISM。

Security in LLM-as-a-Judge: A Comprehensive SoK

2603.29403v1 by Aiman Almasoud, Antony Anju, Marco Arazzi, Mert Cihangiroglu, Vignesh Kumar Kembu, Serena Nicolazzo, Antonino Nocera, Vinod P., Saraga Sakthidharan

LLM-as-a-Judge (LaaJ) is a novel paradigm in which powerful language models are used to assess the quality, safety, or correctness of generated outputs. While this paradigm has significantly improved the scalability and efficiency of evaluation processes, it also introduces novel security risks and reliability concerns that remain largely unexplored. In particular, LLM-based judges can become both targets of adversarial manipulation and instruments through which attacks are conducted, potentially compromising the trustworthiness of evaluation pipelines. In this paper, we present the first Systematization of Knowledge (SoK) focusing on the security aspects of LLM-as-a-Judge systems. We perform a comprehensive literature review across major academic databases, analyzing 863 works and selecting 45 relevant studies published between 2020 and 2026. Based on this study, we propose a taxonomy that organizes recent research according to the role played by LLM-as-a-Judge in the security landscape, distinguishing between attacks targeting LaaJ systems, attacks performed through LaaJ, defenses leveraging LaaJ for security purposes, and applications where LaaJ is used as an evaluation strategy in security-related domains. We further provide a comparative analysis of existing approaches, highlighting current limitations, emerging threats, and open research challenges. Our findings reveal significant vulnerabilities in LLM-based evaluation frameworks, as well as promising directions for improving their robustness and reliability. Finally, we outline key research opportunities that can guide the development of more secure and trustworthy LLM-as-a-Judge systems.

摘要:LLM-as-a-Judge (LaaJ) 是一種新穎的範式,其中強大的語言模型被用來評估生成輸出的質量、安全性或正確性。雖然這一範式顯著提高了評估過程的可擴展性和效率,但它也引入了新的安全風險和可靠性問題,這些問題仍然在很大程度上未被探索。特別是,基於 LLM 的評判者可以成為對抗性操控的目標,也可以成為進行攻擊的工具,這可能會損害評估管道的可信度。在本文中,我們提出了首個針對 LLM-as-a-Judge 系統安全方面的知識系統化(SoK)。我們對主要學術數據庫進行了全面的文獻回顧,分析了 863 篇作品,並選擇了 2020 年至 2026 年間發表的 45 項相關研究。基於這項研究,我們提出了一個分類法,根據 LLM-as-a-Judge 在安全領域中所扮演的角色來組織近期研究,區分針對 LaaJ 系統的攻擊、通過 LaaJ 進行的攻擊、利用 LaaJ 進行安全防禦的防禦措施,以及在安全相關領域中將 LaaJ 作為評估策略的應用。我們還提供了現有方法的比較分析,突顯當前的限制、新興威脅和未解決的研究挑戰。我們的研究結果揭示了基於 LLM 的評估框架中的重大漏洞,以及改善其穩健性和可靠性的有前景方向。最後,我們概述了關鍵的研究機會,可以指導更安全和更值得信賴的 LLM-as-a-Judge 系統的發展。

Beyond Idealized Patients: Evaluating LLMs under Challenging Patient Behaviors in Medical Consultations

2603.29373v1 by Yahan Li, Xinyi Jie, Wanjia Ruan, Xubei Zhang, Huaijie Zhu, Yicheng Gao, Chaohao Du, Ruishan Liu

Large language models (LLMs) are increasingly used for medical consultation and health information support. In this high-stakes setting, safety depends not only on medical knowledge, but also on how models respond when patient inputs are unclear, inconsistent, or misleading. However, most existing medical LLM evaluations assume idealized and well-posed patient questions, which limits their realism. In this paper, we study challenging patient behaviors that commonly arise in real medical consultations and complicate safe clinical reasoning. We define four clinically grounded categories of such behaviors: information contradiction, factual inaccuracy, self-diagnosis, and care resistance. For each behavior, we specify concrete failure criteria that capture unsafe responses. Building on four existing medical dialogue datasets, we introduce CPB-Bench (Challenging Patient Behaviors Benchmark), a bilingual (English and Chinese) benchmark of 692 multi-turn dialogues annotated with these behaviors. We evaluate a range of open- and closed-source LLMs on their responses to challenging patient utterances. While models perform well overall, we identify consistent, behavior-specific failure patterns, with particular difficulty in handling contradictory or medically implausible patient information. We also study four intervention strategies and find that they yield inconsistent improvements and can introduce unnecessary corrections. We release the dataset and code.

摘要:大型語言模型(LLMs)越來越多地用於醫療諮詢和健康信息支持。在這個高風險的環境中,安全性不僅取決於醫療知識,還取決於模型在患者輸入不明確、不一致或誤導時的反應。然而,大多數現有的醫療LLM評估假設患者問題是理想化且明確的,這限制了它們的現實性。在本文中,我們研究在真實醫療諮詢中常見的挑戰性患者行為,這些行為使安全的臨床推理變得複雜。我們定義了四種臨床基礎的行為類別:信息矛盾、事實不準確、自我診斷和護理抵抗。對於每種行為,我們具體說明了捕捉不安全反應的具體失敗標準。基於四個現有的醫療對話數據集,我們引入了CPB-Bench(挑戰性患者行為基準),這是一個包含692個多輪對話的雙語(英語和中文)基準,並對這些行為進行了標註。我們評估了一系列開源和閉源的LLM對挑戰性患者表述的反應。儘管模型整體表現良好,但我們識別出一致的、特定行為的失敗模式,特別是在處理矛盾或醫學上不合理的患者信息時遇到困難。我們還研究了四種干預策略,發現它們產生不一致的改進,並可能引入不必要的修正。我們發布了數據集和代碼。

L-ReLF: A Framework for Lexical Dataset Creation

2603.29346v1 by Anass Sedrati, Mounir Afifi, Reda Benkhadra

This paper introduces the L-ReLF (Low-Resource Lexical Framework), a novel, reproducible methodology for creating high-quality, structured lexical datasets for underserved languages. The lack of standardized terminology, exemplified by Moroccan Darija, poses a critical barrier to knowledge equity in platforms like Wikipedia, often forcing editors to rely on inconsistent, ad-hoc methods to create new words in their language. Our research details the technical pipeline developed to overcome these challenges. We systematically address the difficulties of working with low-resource data, including source identification, utilizing Optical Character Recognition (OCR) despite its bias towards Modern Standard Arabic, and rigorous post-processing to correct errors and standardize the data model. The resulting structured dataset is fully compatible with Wikidata Lexemes, serving as a vital technical resource. The L-ReLF methodology is designed for generalizability, offering other language communities a clear path to build foundational lexical data for downstream NLP applications, such as Machine Translation and morphological analysis.

摘要:這篇論文介紹了 L-ReLF(低資源詞彙框架),這是一種新穎的、可重複的方法論,用於為資源不足的語言創建高質量、結構化的詞彙數據集。缺乏標準化的術語,以摩洛哥達里賈為例,對維基百科等平台的知識公平構成了重大障礙,常常迫使編輯依賴不一致的、臨時的方法來創造他們語言中的新詞。我們的研究詳細說明了為克服這些挑戰而開發的技術流程。我們系統性地解決了處理低資源數據的困難,包括來源識別,儘管光學字符識別(OCR)對現代標準阿拉伯語有偏見,仍然利用它,並進行嚴格的後處理以糾正錯誤並標準化數據模型。最終生成的結構化數據集與 Wikidata Lexemes 完全兼容,成為一個重要的技術資源。L-ReLF 方法論旨在具有普遍性,為其他語言社群提供了一條明確的路徑,以建立基礎詞彙數據,用於下游自然語言處理應用,如機器翻譯和形態分析。

CADEL: A Corpus of Administrative Web Documents for Japanese Entity Linking

2603.29336v1 by Shohei Higashiyama, Masao Ideuchi, Masao Utiyama

Entity linking is the task of associating linguistic expressions with entries in a knowledge base that represent real-world entities and concepts. Language resources for this task have primarily been developed for English, and the resources available for evaluating Japanese systems remain limited. In this study, we develop a corpus design policy for the entity linking task and construct an annotated corpus for training and evaluating Japanese entity linking systems, with rich coverage of linguistic expressions referring to entities that are specific to Japan. Evaluation of inter-annotator agreement confirms the high consistency of the annotations in the corpus, and a preliminary experiment on entity disambiguation based on string matching suggests that the corpus contains a substantial number of non-trivial cases, supporting its potential usefulness as an evaluation benchmark.

摘要:實體連結是將語言表達與知識庫中代表現實世界實體和概念的條目相關聯的任務。這項任務的語言資源主要是為英語開發的,而用於評估日語系統的資源仍然有限。在本研究中,我們為實體連結任務制定了一個語料庫設計政策,並構建了一個帶有標註的語料庫,用於訓練和評估日語實體連結系統,涵蓋了大量指涉特定於日本的實體的語言表達。對標註者間一致性的評估確認了語料庫中標註的一致性很高,而基於字符串匹配的實體消歧的初步實驗則表明,該語料庫包含了相當數量的非平凡案例,支持其作為評估基準的潛在有用性。

PRISM: A Multi-View Multi-Capability Retail Video Dataset for Embodied Vision-Language Models

2603.29281v1 by Amirreza Rouhi, Parikshit Sakurikar, Satya Sai Reddy, Narsimha Menga, Anirudh Govil, Sri Harsha Chittajallu, Rajat Aggarwal, Anoop Namboodiri, Sashi Reddi

A critical gap exists between the general-purpose visual understanding of state-of-the-art physical AI models and the specialized perceptual demands of structured real-world deployment environments. We present PRISM, a 270K-sample multi-view video supervised fine-tuning (SFT) corpus for embodied vision-language-models (VLMs) in real-world retail environments. PRISM is motivated by a simple observation - physical AI systems fail not because of poor visual recognition, but because they do not understand space, physical dynamics and embodied action well enough to operate reliably in the world. To this end, PRISM is grounded in a novel three-dimensional knowledge ontology that spans spatial knowledge, temporal and physical knowledge, and embodied action knowledge. It covers 20+ capability probes across four evaluation dimensions - Embodied Reasoning (ER), Common Sense (CS), Spatial Perception (SP), and Intuitive Physics (IP), and to our knowledge, PRISM is the first dataset to instantiate all three knowledge dimensions within a single real-world deployment domain. The corpus captures data from egocentric, exocentric and 360° viewpoints across five supermarket locations and includes open-ended, chain-of-thought, and multiple-choice supervision. At 4 fps, PRISM spans approximately 11.8M video frames and approximately 730M tokens, placing it among the largest domain-specific video SFT corpora. Fine-tuning on PRISM reduces the error rate across all 20+ probes by 66.6% over the pre-trained baseline, with significant gains in embodied action understanding where the accuracy improves by 36.4%. Our results suggest that ontology-structured, domain specific SFT can meaningfully strengthen embodied VLMs for real-world settings. The PRISM dataset and more details are available at https://dreamvu.ai/prism

摘要:一個關鍵的差距存在於最先進的物理人工智慧模型的通用視覺理解與結構化現實世界部署環境的專門感知需求之間。我們提出了PRISM,一個包含270K樣本的多視角視頻監督微調(SFT)語料庫,專為現實世界零售環境中的具身視覺-語言模型(VLMs)而設。PRISM的動機來自一個簡單的觀察——物理人工智慧系統失敗並不是因為視覺識別不佳,而是因為它們對空間、物理動態和具身行動的理解不夠深入,無法在現實世界中可靠運作。為此,PRISM基於一個新穎的三維知識本體,涵蓋空間知識、時間和物理知識以及具身行動知識。它涵蓋了20多個能力探測器,跨越四個評估維度——具身推理(ER)、常識(CS)、空間感知(SP)和直觀物理(IP),據我們所知,PRISM是第一個在單一現實世界部署領域中實現所有三個知識維度的數據集。該語料庫捕捉了來自五個超市地點的自我中心、外部中心和360°視角的數據,並包括開放式、思維鏈和多選監督。在4 fps的速度下,PRISM涵蓋了約11.8M視頻幀和約730M標記,使其成為最大的領域特定視頻SFT語料庫之一。在PRISM上進行微調使所有20多個探測器的錯誤率比預訓練基線降低了66.6%,在具身行動理解方面的顯著增益使準確率提高了36.4%。我們的結果表明,本體結構的領域特定SFT可以有意義地增強具身VLM在現實世界環境中的表現。PRISM數據集及更多詳細信息可在https://dreamvu.ai/prism獲得。

Long-Document QA with Chain-of-Structured-Thought and Fine-Tuned SLMs

2603.29232v1 by Zhuowen Liang, Xiaotian Lin, Zhengxuan Zhang, Yuyu Luo, Haixun Wang, Nan Tang

Large language models (LLMs) are widely applied to data analytics over documents, yet direct reasoning over long, noisy documents remains brittle and error-prone. Hence, we study document question answering (QA) that consolidates dispersed evidence into a structured output (e.g., a table, graph, or chunks) to support reliable, verifiable QA. We propose a two-pillar framework, LiteCoST, to achieve both high accuracy and low latency with small language models (SLMs). Pillar 1: Chain-of-Structured-Thought (CoST). We introduce a CoST template, a schema-aware instruction that guides a strong LLM to produce both a step-wise CoST trace and the corresponding structured output. The process induces a minimal structure, normalizes entities/units, aligns records, serializes the output, and verifies/refines it, yielding auditable supervision. Pillar 2: SLM fine-tuning. The compact models are trained on LLM-generated CoST data in two stages: Supervised Fine-Tuning for structural alignment, followed by Group Relative Policy Optimization (GRPO) incorporating triple rewards for answer/format quality and process consistency. By distilling structure-first behavior into SLMs, this approach achieves LLM-comparable quality on multi-domain long-document QA using 3B/7B SLMs, while delivering 2-4x lower latency than GPT-4o and DeepSeek-R1 (671B). The code is available at https://github.com/HKUSTDial/LiteCoST.

摘要:大型語言模型(LLMs)廣泛應用於文件的數據分析,但對於長且雜訊多的文件進行直接推理仍然脆弱且容易出錯。因此,我們研究文件問題回答(QA),將分散的證據整合成結構化的輸出(例如,表格、圖形或區塊),以支持可靠且可驗證的QA。我們提出了一個雙支柱框架LiteCoST,以實現小型語言模型(SLMs)的高準確性和低延遲。支柱1:結構化思維鏈(CoST)。我們介紹了一個CoST模板,這是一種具有結構感知的指令,指導強大的LLM生成逐步的CoST痕跡和相應的結構化輸出。該過程引入了最小結構,標準化實體/單位,對齊記錄,序列化輸出,並驗證/精煉它,從而產生可審計的監督。支柱2:SLM微調。這些緊湊的模型在LLM生成的CoST數據上進行兩個階段的訓練:結構對齊的監督微調,隨後是集體相對策略優化(GRPO),該方法結合了三重獎勵以提高答案/格式質量和過程一致性。通過將結構優先行為提煉到SLMs中,這種方法在使用3B/7B SLMs的多領域長文檔QA上實現了與LLM可比的質量,同時提供了比GPT-4o和DeepSeek-R1(671B)低2-4倍的延遲。代碼可在https://github.com/HKUSTDial/LiteCoST獲得。

Software Vulnerability Detection Using a Lightweight Graph Neural Network

2603.29216v1 by Miles Farmer, Ekincan Ufuktepe, Anne Watson, Hialo Muniz Carvalho, Vadim Okun, Zineb Maasaoui, Kannappan Palaniappan

Large Language Models (LLMs) have emerged as a popular choice in vulnerability detection studies given their foundational capabilities, open source availability, and variety of models, but have limited scalability due to extensive compute requirements. Using the natural graph relational structure of code, we show that our proposed graph neural network (GNN) based deep learning model VulGNN for vulnerability detection can achieve performance almost on par with LLMs, but is 100 times smaller in size and fast to retrain and customize. We describe the VulGNN architecture, ablation studies on components, learning rates, and generalizability to different code datasets. As a lightweight model for vulnerability analysis, VulGNN is efficient and deployable at the edge as part of real-world software development pipelines.

摘要:大型語言模型(LLMs)因其基礎能力、開源可用性和多樣的模型而成為漏洞檢測研究中的熱門選擇,但由於計算需求龐大,擴展性有限。利用代碼的自然圖形關係結構,我們展示了我們提出的基於圖神經網絡(GNN)的深度學習模型VulGNN在漏洞檢測中的性能幾乎可以與LLMs相媲美,但其大小小100倍,且快速可重訓練和自定義。我們描述了VulGNN架構、組件的消融研究、學習率以及對不同代碼數據集的可泛化性。作為一種輕量級的漏洞分析模型,VulGNN高效且可在邊緣部署,作為現實世界軟件開發流程的一部分。

Towards Automatic Soccer Commentary Generation with Knowledge-Enhanced Visual Reasoning

2604.00057v1 by Zeyu Jin, Xiaoyu Qin, Songtao Zhou, Kaifeng Yun, Jia Jia

Soccer commentary plays a crucial role in enhancing the soccer game viewing experience for audiences. Previous studies in automatic soccer commentary generation typically adopt an end-to-end method to generate anonymous live text commentary. Such generated commentary is insufficient in the context of real-world live televised commentary, as it contains anonymous entities, context-dependent errors and lacks statistical insights of the game events. To bridge the gap, we propose GameSight, a two-stage model to address soccer commentary generation as a knowledge-enhanced visual reasoning task, enabling live-televised-like knowledgeable commentary with accurate reference to entities (players and teams). GameSight starts by performing visual reasoning to align anonymous entities with fine-grained visual and contextual analysis. Subsequently, the entity-aligned commentary is refined with knowledge by incorporating external historical statistics and iteratively updated internal game state information. Consequently, GameSight improves the player alignment accuracy by 18.5% on SN-Caption-test-align dataset compared to Gemini 2.5-pro. Combined with further knowledge enhancement, GameSight outperforms in segment-level accuracy and commentary quality, as well as game-level contextual relevance and structural composition. We believe that our work paves the way for a more informative and engaging human-centric experience with the AI sports application. Demo Page: https://gamesight2025.github.io/gamesight2025

摘要:足球解說在提升觀眾的足球比賽觀賞體驗中扮演著至關重要的角色。先前在自動足球解說生成方面的研究通常採用端到端的方法來生成匿名的現場文字解說。這種生成的解說在現實世界的現場電視解說中是不足夠的,因為它包含匿名實體、依賴上下文的錯誤,並且缺乏比賽事件的統計見解。為了彌補這一差距,我們提出了GameSight,一個兩階段模型,將足球解說生成視為一項知識增強的視覺推理任務,使其能夠提供類似現場轉播的知識性解說,並準確引用實體(球員和球隊)。GameSight首先通過視覺推理來對齊匿名實體,並進行細緻的視覺和上下文分析。隨後,通過整合外部歷史統計數據和迭代更新的內部比賽狀態信息,對齊的解說得到了知識的精煉。因此,與Gemini 2.5-pro相比,GameSight在SN-Caption-test-align數據集上提高了18.5%的球員對齊準確性。結合進一步的知識增強,GameSight在段落級準確性和解說質量,以及比賽級上下文相關性和結構組成方面表現優於其他模型。我們相信,我們的工作為AI體育應用提供了一個更具信息性和吸引力的人本體驗鋪平了道路。演示頁面:https://gamesight2025.github.io/gamesight2025

Knowledge database development by large language models for countermeasures against viruses and marine toxins

2603.29149v1 by Hung N. Do, Jessica Z. Kubicek-Sutherland, S. Gnanakaran

Access to the most up-to-date information on medical countermeasures is important for the research and development of effective treatments for viruses and marine toxins. However, there is a lack of comprehensive databases that curate data on viruses and marine toxins, making decisions on medical countermeasures slow and difficult. In this work, we employ two large language models (LLMs) of ChatGPT and Grok to design two comprehensive databases of therapeutic countermeasures for five viruses of Lassa, Marburg, Ebola, Nipah, and Venezuelan equine encephalitis, as well as marine toxins. With high-level human-provided inputs, the two LLMs identify public databases containing data on the five viruses and marine toxins, collect relevant information from these databases and the literature, iteratively cross-validate the collected information, and design interactive webpages for easy access to the curated, comprehensive databases. Notably, the ChatGPT LLM is employed to design agentic AI workflows (consisting of two AI agents for research and decision-making) to rank countermeasures for viruses and marine toxins in the databases. Together, our work explores the potential of LLMs as a scalable, updatable approach for building comprehensive knowledge databases and supporting evidence-based decision-making.

摘要:獲取有關醫療對策的最新資訊對於病毒和海洋毒素的有效治療的研究與開發至關重要。然而,缺乏綜合數據庫來整理有關病毒和海洋毒素的數據,使得醫療對策的決策變得緩慢且困難。在這項工作中,我們使用兩個大型語言模型(LLMs),即ChatGPT和Grok,設計了兩個針對拉薩病毒、馬爾堡病毒、埃博拉病毒、尼帕病毒和委內瑞拉馬腦炎病毒以及海洋毒素的治療對策的綜合數據庫。在高層次的人類提供的輸入下,這兩個LLMs識別出包含五種病毒和海洋毒素數據的公共數據庫,從這些數據庫和文獻中收集相關信息,迭代交叉驗證所收集的信息,並設計互動網頁以便於訪問整理好的綜合數據庫。值得注意的是,ChatGPT LLM被用來設計代理式AI工作流程(由兩個AI代理組成,分別負責研究和決策),以對數據庫中的病毒和海洋毒素對策進行排名。總之,我們的工作探討了LLMs作為可擴展、可更新的方法來建立綜合知識數據庫並支持基於證據的決策的潛力。

LLM

Publish Date Title Authors Homepage Code
2026-04-02 ActionParty: Multi-Subject Action Binding in Generative Video Games Alexander Pondaven et.al. 2604.02330v1 null
2026-04-02 Steerable Visual Representations Jona Ruthardt et.al. 2604.02327v1 null
2026-04-02 Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation Daiwei Chen et.al. 2604.02324v1 null
2026-04-02 Batched Contextual Reinforcement: A Task-Scaling Law for Efficient Reasoning Bangji Yang et.al. 2604.02322v1 null
2026-04-02 No Single Best Model for Diversity: Learning a Router for Sample Diversity Yuhan Liu et.al. 2604.02319v1 null
2026-04-02 Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models Sarath Shekkizhar et.al. 2604.02315v1 null
2026-04-02 go-$m$HC: Direct Parameterization of Manifold-Constrained Hyper-Connections via Generalized Orthostochastic Matrices Torque Dandachi et.al. 2604.02309v1 null
2026-04-02 VOID: Video Object and Interaction Deletion Saman Motamed et.al. 2604.02296v1 null
2026-04-02 Omni123: Exploring 3D Native Foundation Models with Limited 3D Data by Unifying Text to 2D and 3D Generation Chongjie Ye et.al. 2604.02289v1 null
2026-04-02 Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing Gengsheng Li et.al. 2604.02288v1 null
2026-04-02 De Jure: Iterative LLM Self-Refinement for Structured Extraction of Regulatory Rules Keerat Guliani et.al. 2604.02276v1 null
2026-04-02 Crystalite: A Lightweight Transformer for Efficient Crystal Modeling Tin Hadži Veljković et.al. 2604.02270v1 null
2026-04-02 Retrieval-Augmented Question Answering over Scientific Literature for the Electron-Ion Collider Tina. J. Jat et.al. 2604.02259v1 null
2026-04-02 Generative AI Spotlights the Human Core of Data Science: Implications for Education Nathan Taback et.al. 2604.02238v1 null
2026-04-02 Do Emotions in Prompts Matter? Effects of Emotional Framing on Large Language Models Minda Zhao et.al. 2604.02236v1 null
2026-04-02 Answering the Wrong Question: Reasoning Trace Inversion for Abstention in LLMs Abinitha Gourabathina et.al. 2604.02230v1 null
2026-04-02 When to ASK: Uncertainty-Gated Language Assistance for Reinforcement Learning Juarez Monteiro et.al. 2604.02226v1 null
2026-04-02 Impact of Multimodal and Conversational AI on Learning Outcomes and Experience Karan Taneja et.al. 2604.02221v1 null
2026-04-02 VISTA: Visualization of Token Attribution via Efficient Analysis Syed Ahmed et.al. 2604.02217v1 null
2026-04-02 Universal Hypernetworks for Arbitrary Models Xuanfeng Zhou et.al. 2604.02215v1 null
2026-04-02 Multi-Agent Video Recommenders: Evolution, Patterns, and Open Challenges Srivaths Ranganathan et.al. 2604.02211v1 null
2026-04-02 CV-18 NER: Augmented Common Voice for Named Entity Recognition from Arabic Speech Youssef Saidi et.al. 2604.02209v1 null
2026-04-02 Blinded Radiologist and LLM-Based Evaluation of LLM-Generated Japanese Translations of Chest CT Reports: Comparative Study Yosuke Yamagishi et.al. 2604.02207v1 null
2026-04-02 LEO: Graph Attention Network based Hybrid Multi Sensor Extended Object Fusion and Tracking for Autonomous Driving Applications Mayank Mayank et.al. 2604.02206v1 null
2026-04-02 Towards Position-Robust Talent Recommendation via Large Language Models Silin Du et.al. 2604.02200v1 null
2026-04-02 Neuro-RIT: Neuron-Guided Instruction Tuning for Robust Retrieval-Augmented Language Model Jaemin Kim et.al. 2604.02194v1 null
2026-04-02 TRU: Targeted Reverse Update for Efficient Multimodal Recommendation Unlearning Zhanting Zhou et.al. 2604.02183v1 null
2026-04-02 The Expert Strikes Back: Interpreting Mixture-of-Experts Language Models at Expert Level Jeremy Herbst et.al. 2604.02178v1 null
2026-04-02 Adam's Law: Textual Frequency Law on Large Language Models Hongyuan Adam Lu et.al. 2604.02176v1 null
2026-04-02 Quantifying Self-Preservation Bias in Large Language Models Matteo Migliarini et.al. 2604.02174v1 null
2026-04-02 AstroConcepts: A Large-Scale Multi-Label Classification Corpus for Astrophysics Atilla Kaan Alkan et.al. 2604.02156v1 null
2026-04-02 Brief Is Better: Non-Monotonic Chain-of-Thought Budget Effects in Function-Calling Language Agents Xuan Qi et.al. 2604.02155v1 null
2026-04-02 TRACE-Bot: Detecting Emerging LLM-Driven Social Bots via Implicit Semantic Representations and AIGC-Enhanced Behavioral Patterns Zhongbo Wang et.al. 2604.02147v1 null
2026-04-02 MTI: A Behavior-Based Temperament Profiling System for AI Agents Jihoon Jeong et.al. 2604.02145v1 null
2026-04-02 GaelEval: Benchmarking LLM Performance for Scottish Gaelic Peter Devine et.al. 2604.02135v1 null
2026-04-02 Intelligent Cloud Orchestration: A Hybrid Predictive and Heuristic Framework for Cost Optimization Heet Nagoriya et.al. 2604.02131v1 null
2026-04-02 SEAL: An Open, Auditable, and Fair Data Generation Framework for AI-Native 6G Networks Sunder Ali Khowaja et.al. 2604.02128v1 null
2026-04-02 LLM-as-a-Judge for Time Series Explanations Preetham Sivalingam et.al. 2604.02118v1 null
2026-04-02 Reliable Control-Point Selection for Steering Reasoning in Large Language Models Haomin Zhuang et.al. 2604.02113v1 null
2026-04-02 Prosodic ABX: A Language-Agnostic Method for Measuring Prosodic Contrast in Speech Representations Haitong Sun et.al. 2604.02102v1 null
2026-04-02 Optimizing RAG Rerankers with LLM Feedback via Reinforcement Learning Yuhang Wu et.al. 2604.02091v1 null
2026-04-02 Mining Instance-Centric Vision-Language Contexts for Human-Object Interaction Detection Soo Won Seo et.al. 2604.02071v1 null
2026-04-02 Ouroboros: Dynamic Weight Generation for Recursive Transformers via Input-Conditioned LoRA Modulation Jaber Jaber et.al. 2604.02051v1 null
2026-04-02 Goose: Anisotropic Speculation Trees for Training-Free Speculative Decoding Tao Jin et.al. 2604.02047v1 null
2026-04-02 BidirLM: From Text to Omnimodal Bidirectional Encoders by Adapting and Composing Causal LLMs Nicolas Boizard et.al. 2604.02045v1 null
2026-04-02 Tracking the emergence of linguistic structure in self-supervised models learning from speech Marianne de Heer Kloots et.al. 2604.02043v1 null
2026-04-02 AI in Insurance: Adaptive Questionnaires for Improved Risk Profiling Diogo Silva et.al. 2604.02034v1 null
2026-04-02 Rare-Aware Autoencoding: Reconstructing Spatially Imbalanced Data Alejandro Castañeda Garcia et.al. 2604.02031v1 null
2026-04-02 The Latent Space: Foundation, Evolution, Mechanism, Ability, and Outlook Xinlei Yu et.al. 2604.02029v1 null
2026-04-02 Why Gaussian Diffusion Models Fail on Discrete Data? Alexander Shabalin et.al. 2604.02028v1 null
2026-04-02 ATBench: A Diverse and Realistic Trajectory Benchmark for Long-Horizon Agent Safety Yu Li et.al. 2604.02022v1 null
2026-04-02 Optimizing Interventions for Agent-Based Infectious Disease Simulations Anja Wolpers et.al. 2604.02016v1 null
2026-04-02 $k$NNProxy: Efficient Training-Free Proxy Alignment for Black-Box Zero-Shot LLM-Generated Text Detection Kahim Wong et.al. 2604.02008v1 null
2026-04-02 ProCeedRL: Process Critic with Exploratory Demonstration Reinforcement Learning for LLM Agentic Reasoning Jingyue Gao et.al. 2604.02006v1 null
2026-04-02 How and why does deep ensemble coupled with transfer learning increase performance in bipolar disorder and schizophrenia classification? Sara Petiton et.al. 2604.02002v1 null
2026-04-02 GenGait: A Transformer-Based Model for Human Gait Anomaly Detection and Normative Twin Generation Elisa Motta et.al. 2604.01997v1 null
2026-04-02 SAFE: Stepwise Atomic Feedback for Error correction in Multi-hop Reasoning Daeyong Kwon et.al. 2604.01993v1 null
2026-04-02 Attention at Rest Stays at Rest: Breaking Visual Inertia for Cognitive Hallucination Mitigation Boyang Gong et.al. 2604.01989v1 null
2026-04-02 SenseMath: Do LLMs Have Number Sense? Evaluating Shortcut Use, Judgment, and Generation Haomin Zhuang et.al. 2604.01988v1 null
2026-04-02 World Action Verifier: Self-Improving World Models via Forward-Inverse Asymmetry Yuejiang Liu et.al. 2604.01985v1 null
2026-04-02 RuleForge: Automated Generation and Validation for Web Vulnerability Detection at Scale Ayush Garg et.al. 2604.01977v1 null
2026-04-02 Ego-Grounding for Personalized Question-Answering in Egocentric Videos Junbin Xiao et.al. 2604.01966v1 null
2026-04-02 Do We Need Bigger Models for Science? Task-Aware Retrieval with Small Language Models Florian Kelber et.al. 2604.01965v1 null
2026-04-02 Abnormal Head Movements in Neurological Conditions: A Knowledge-Based Dataset with Application to Cervical Dystonia Saja Al-Dabet et.al. 2604.01962v1 null
2026-04-02 Diagnosing Translated Benchmarks: An Automated Quality Assurance Study of the EU20 Benchmark Suite Klaudia Thellmann et.al. 2604.01957v1 null
2026-04-02 Physics-Informed Transformer for Multi-Band Channel Frequency Response Reconstruction Anatolij Zubow et.al. 2604.01944v1 null
2026-04-02 Captioning Daily Activity Images in Early Childhood Education: Benchmark and Algorithm Sixing Li et.al. 2604.01941v1 null
2026-04-02 Probabilistic classification from possibilistic data: computing Kullback-Leibler projection with a possibility distribution Ismaïl Baaj et.al. 2604.01939v1 null
2026-04-02 How to measure the optimality of word or gesture order with respect to the principle of swap distance minimization Ramon Ferrer-i-Cancho et.al. 2604.01938v1 null
2026-04-02 Reliable News or Propagandist News? A Neurosymbolic Model Using Genre, Topic, and Persuasion Techniques to Improve Robustness in Classification Géraud Faye et.al. 2604.01936v1 null
2026-04-02 Quantum-Inspired Geometric Classification with Correlation Group Structures and VQC Decision Modeling Nishikanta Mohanty et.al. 2604.01930v1 null
2026-04-02 Woosh: A Sound Effects Foundation Model Gaëtan Hadjeres et.al. 2604.01929v1 null
2026-04-02 ImplicitBBQ: Benchmarking Implicit Bias in Large Language Models through Characteristic Based Cues Bhaskara Hanuma Vedula et.al. 2604.01925v1 null
2026-04-02 Is Clinical Text Enough? A Multimodal Study on Mortality Prediction in Heart Failure Patients Oumaima El Khettari et.al. 2604.01924v1 null
2026-04-02 SURE: Synergistic Uncertainty-aware Reasoning for Multimodal Emotion Recognition in Conversations Yiqiang Cai et.al. 2604.01916v1 null
2026-04-02 Lifting Unlabeled Internet-level Data for 3D Scene Understanding Yixin Chen et.al. 2604.01907v1 null
2026-04-02 Combating Data Laundering in LLM Training Muxing Li et.al. 2604.01904v1 null
2026-04-02 Bayesian Elicitation with LLMs: Model Size Helps, Extra "Reasoning" Doesn't Always Luka Hobor et.al. 2604.01896v1 null
2026-04-02 HieraVid: Hierarchical Token Pruning for Fast Video Large Language Models Yansong Guo et.al. 2604.01881v1 null
2026-04-02 Beyond Detection: Ethical Foundations for Automated Dyslexic Error Attribution Samuel Rose et.al. 2604.01853v1 null
2026-04-02 From Guessing to Placeholding: A Cost-Theoretic Framework for Uncertainty-Aware Code Completion Liang Zhu et.al. 2604.01849v1 null
2026-04-02 CANDI: Curated Test-Time Adaptation for Multivariate Time-Series Anomaly Detection Under Distribution Shift HyunGi Kim et.al. 2604.01845v1 null
2026-04-02 Retrieval-aligned Tabular Foundation Models Enable Robust Clinical Risk Prediction in Electronic Health Records Under Real-world Constraints Minh-Khoi Pham et.al. 2604.01841v1 null
2026-04-02 Not All Tokens See Equally: Perception-Grounded Policy Optimization for Large Vision-Language Models Zekai Ye et.al. 2604.01840v1 null
2026-04-02 PLOT: Enhancing Preference Learning via Optimal Transport Liang Zhu et.al. 2604.01837v1 null
2026-04-02 Language-Pretraining-Induced Bias: A Strong Foundation for General Vision Tasks Yaxin Luo et.al. 2604.01833v1 null
2026-04-02 Neural Network-Assisted Model Predictive Control for Implicit Balancing Seyed Soroush Karimi Madahi et.al. 2604.01805v1 null
2026-04-02 DEFT: Distribution-guided Efficient Fine-Tuning for Human Alignment Liang Zhu et.al. 2604.01787v1 null
2026-04-02 Taming CATS: Controllable Automatic Text Simplification through Instruction Fine-Tuning with Control Tokens Hanna Hubarava et.al. 2604.01779v1 null
2026-04-02 FSKD: Monocular Forest Structure Inference via LiDAR-to-RGBI Knowledge Distillation Taimur Khan et.al. 2604.01766v1 null
2026-04-02 DriveDreamer-Policy: A Geometry-Grounded World-Action Model for Unified Generation and Planning Yang Zhou et.al. 2604.01765v1 null
2026-04-02 FourierMoE: Fourier Mixture-of-Experts Adaptation of Large Language Models Juyong Jiang et.al. 2604.01762v1 null
2026-04-02 LiveMathematicianBench: A Live Benchmark for Mathematician-Level Reasoning with Proof Sketches Linyang He et.al. 2604.01754v1 null
2026-04-02 Detecting Toxic Language: Ontology and BERT-based Approaches for Bulgarian Text Melania Berbatova et.al. 2604.01745v1 null
2026-04-02 AeroTherm-GPT: A Verification-Centered LLM Framework for Thermal Protection System Engineering Workflows Chuhan Qiao et.al. 2604.01738v1 null
2026-04-02 The AnIML Ontology: Enabling Semantic Interoperability for Large-Scale Experimental Data in Interconnected Scientific Labs Wilf Morlidge et.al. 2604.01728v1 null
2026-04-02 LiteInception: A Lightweight and Interpretable Deep Learning Framework for General Aviation Fault Diagnosis Zhihuan Wei et.al. 2604.01725v1 null
2026-04-02 Causal Scene Narration with Runtime Safety Supervision for Vision-Language-Action Driving Yun Li et.al. 2604.01723v1 null
2026-04-02 Transformer self-attention encoder-decoder with multimodal deep learning for response time series forecasting and digital twin support in wind structural health monitoring Feiyu Zhou et.al. 2604.01712v1 null
2026-04-02 Human-Guided Reasoning with Large Language Models for Vietnamese Speech Emotion Recognition Truc Nguyen et.al. 2604.01711v1 null

Abstracts

ActionParty: Multi-Subject Action Binding in Generative Video Games

2604.02330v1 by Alexander Pondaven, Ziyi Wu, Igor Gilitschenski, Philip Torr, Sergey Tulyakov, Fabio Pizzati, Aliaksandr Siarohin

Recent advances in video diffusion have enabled the development of "world models" capable of simulating interactive environments. However, these models are largely restricted to single-agent settings, failing to control multiple agents simultaneously in a scene. In this work, we tackle a fundamental issue of action binding in existing video diffusion models, which struggle to associate specific actions with their corresponding subjects. For this purpose, we propose ActionParty, an action controllable multi-subject world model for generative video games. It introduces subject state tokens, i.e. latent variables that persistently capture the state of each subject in the scene. By jointly modeling state tokens and video latents with a spatial biasing mechanism, we disentangle global video frame rendering from individual action-controlled subject updates. We evaluate ActionParty on the Melting Pot benchmark, demonstrating the first video world model capable of controlling up to seven players simultaneously across 46 diverse environments. Our results show significant improvements in action-following accuracy and identity consistency, while enabling robust autoregressive tracking of subjects through complex interactions.

摘要:最近在視頻擴散方面的進展使得能夠開發出能夠模擬互動環境的「世界模型」。然而,這些模型在很大程度上僅限於單一代理的設置,無法在場景中同時控制多個代理。在這項工作中,我們解決了現有視頻擴散模型中動作綁定的基本問題,這些模型難以將特定動作與其相應的主體關聯起來。為此,我們提出了ActionParty,一種可控動作的多主體世界模型,用於生成視頻遊戲。它引入了主體狀態標記,即持續捕捉場景中每個主體狀態的潛在變量。通過結合狀態標記和視頻潛變量,並使用空間偏置機制,我們將全局視頻幀渲染與個別動作控制的主體更新進行了區分。我們在Melting Pot基準上評估了ActionParty,展示了第一個能夠在46個多樣化環境中同時控制多達七名玩家的視頻世界模型。我們的結果顯示在動作跟隨準確性和身份一致性方面有顯著改善,同時能夠通過複雜的互動實現穩健的自回歸主體跟踪。

Steerable Visual Representations

2604.02327v1 by Jona Ruthardt, Manu Gaur, Deva Ramanan, Makarand Tapaswi, Yuki M. Asano

Pretrained Vision Transformers (ViTs) such as DINOv2 and MAE provide generic image features that can be applied to a variety of downstream tasks such as retrieval, classification, and segmentation. However, such representations tend to focus on the most salient visual cues in the image, with no way to direct them toward less prominent concepts of interest. In contrast, Multimodal LLMs can be guided with textual prompts, but the resulting representations tend to be language-centric and lose their effectiveness for generic visual tasks. To address this, we introduce Steerable Visual Representations, a new class of visual representations, whose global and local features can be steered with natural language. While most vision-language models (e.g., CLIP) fuse text with visual features after encoding (late fusion), we inject text directly into the layers of the visual encoder (early fusion) via lightweight cross-attention. We introduce benchmarks for measuring representational steerability, and demonstrate that our steerable visual features can focus on any desired objects in an image while preserving the underlying representation quality. Our method also matches or outperforms dedicated approaches on anomaly detection and personalized object discrimination, exhibiting zero-shot generalization to out-of-distribution tasks.

摘要:預訓練的視覺Transformer(ViTs),如 DINOv2 和 MAE,提供了通用的圖像特徵,可應用於各種下游任務,如檢索、分類和分割。然而,這些表示往往專注於圖像中最顯著的視覺線索,無法將其導向不那麼突出的感興趣概念。相比之下,多模態 LLM 可以通過文本提示進行引導,但結果的表示往往以語言為中心,並且對於通用視覺任務的效果降低。為了解決這個問題,我們引入了可引導的視覺表示,這是一種新的視覺表示類別,其全局和局部特徵可以用自然語言進行引導。雖然大多數視覺-語言模型(例如 CLIP)在編碼後融合文本和視覺特徵(晚期融合),但我們通過輕量級的交叉注意力將文本直接注入視覺編碼器的層中(早期融合)。我們引入了用於測量表示可引導性的基準,並展示了我們的可引導視覺特徵可以專注於圖像中任何所需的物體,同時保持基礎表示的質量。我們的方法在異常檢測和個性化物體識別方面的表現與專門方法相當或更優,並展現了對於分佈外任務的零樣本泛化能力。

Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation

2604.02324v1 by Daiwei Chen, Zhoutong Fu, Chengming Jiang, Haichao Zhang, Ran Zhou, Tan Wang, Chunnan Yao, Guoyao Li, Rui Cai, Yihan Cao, Ruijie Jiang, Fedor Borisyuk, Jianqiang Shen, Jingwei Wu, Ramya Korlakai Vinayak

Language models (LMs) are increasingly extended with new learnable vocabulary tokens for domain-specific tasks, such as Semantic-ID tokens in generative recommendation. The standard practice initializes these new tokens as the mean of existing vocabulary embeddings, then relies on supervised fine-tuning to learn their representations. We present a systematic analysis of this strategy: through spectral and geometric diagnostics, we show that mean initialization collapses all new tokens into a degenerate subspace, erasing inter-token distinctions that subsequent fine-tuning struggles to fully recover. These findings suggest that \emph{token initialization} is a key bottleneck when extending LMs with new vocabularies. Motivated by this diagnosis, we propose the \emph{Grounded Token Initialization Hypothesis}: linguistically grounding novel tokens in the pretrained embedding space before fine-tuning better enables the model to leverage its general-purpose knowledge for novel-token domains. We operationalize this hypothesis as GTI (Grounded Token Initialization), a lightweight grounding stage that, prior to fine-tuning, maps new tokens to distinct, semantically meaningful locations in the pretrained embedding space using only paired linguistic supervision. Despite its simplicity, GTI outperforms both mean initialization and existing auxiliary-task adaptation methods in the majority of evaluation settings across multiple generative recommendation benchmarks, including industry-scale and public datasets. Further analyses show that grounded embeddings produce richer inter-token structure that persists through fine-tuning, corroborating the hypothesis that initialization quality is a key bottleneck in vocabulary extension.

摘要:語言模型(LMs)越來越多地擴展了新的可學習詞彙標記,以應對特定領域的任務,例如生成推薦中的語義識別(Semantic-ID)標記。標準做法是將這些新標記初始化為現有詞彙嵌入的均值,然後依賴監督性微調來學習它們的表示。我們對這一策略進行了系統分析:通過光譜和幾何診斷,我們顯示均值初始化將所有新標記壓縮到一個退化子空間中,抹去了標記之間的區別,這使得隨後的微調難以完全恢復。這些發現表明,\emph{標記初始化}是在擴展LMs時引入新詞彙的關鍵瓶頸。受到這一診斷的啟發,我們提出了\emph{基於語言的標記初始化假設}:在微調之前,將新標記在預訓練嵌入空間中進行語言學上的基礎化,更能幫助模型利用其通用知識來應對新標記領域。我們將這一假設具體化為GTI(基於語言的標記初始化),這是一個輕量級的基礎化階段,在微調之前,僅使用配對的語言監督,將新標記映射到預訓練嵌入空間中不同的、語義上有意義的位置。儘管其簡單性,GTI在多數評估設置中超越了均值初始化和現有的輔助任務適應方法,涵蓋了多個生成推薦基準,包括行業規模和公共數據集。進一步分析顯示,基於語言的嵌入產生了更豐富的標記間結構,並在微調過程中持續存在,證實了初始化質量是詞彙擴展中的關鍵瓶頸的假設。

Batched Contextual Reinforcement: A Task-Scaling Law for Efficient Reasoning

2604.02322v1 by Bangji Yang, Hongbo Ma, Jiajun Fan, Ge Liu

Large Language Models employing Chain-of-Thought reasoning achieve strong performance but suffer from excessive token consumption that inflates inference costs. Existing efficiency methods such as explicit length penalties, difficulty estimators, or multi-stage curricula either degrade reasoning quality or require complex training pipelines. We introduce Batched Contextual Reinforcement, a minimalist, single-stage training paradigm that unlocks efficient reasoning through a simple structural modification: training the model to solve N problems simultaneously within a shared context window, rewarded purely by per-instance accuracy. This formulation creates an implicit token budget that yields several key findings: (1) We identify a novel task-scaling law: as the number of concurrent problems N increases during inference, per-problem token usage decreases monotonically while accuracy degrades far more gracefully than baselines, establishing N as a controllable throughput dimension. (2) BCR challenges the traditional accuracy-efficiency trade-off by demonstrating a "free lunch" phenomenon at standard single-problem inference. Across both 1.5B and 4B model families, BCR reduces token usage by 15.8% to 62.6% while consistently maintaining or improving accuracy across five major mathematical benchmarks. (3) Qualitative analyses reveal emergent self-regulated efficiency, where models autonomously eliminate redundant metacognitive loops without explicit length supervision. (4) Crucially, we empirically demonstrate that implicit budget constraints successfully circumvent the adversarial gradients and catastrophic optimization collapse inherent to explicit length penalties, offering a highly stable, constraint-based alternative for length control. These results prove BCR practical, showing simple structural incentives unlock latent high-density reasoning in LLMs.

摘要:大型語言模型使用思維鏈推理達成強勁表現,但卻遭遇過度的標記消耗,這使得推理成本上升。現有的效率方法,如明確的長度懲罰、難度估計器或多階段課程,要么降低推理質量,要么需要複雜的訓練流程。我們引入了批次上下文強化(Batched Contextual Reinforcement),這是一種極簡的單階段訓練範式,通過一個簡單的結構修改來解鎖高效推理:訓練模型在共享的上下文窗口中同時解決 N 個問題,並僅根據每個實例的準確性進行獎勵。這一公式創造了一個隱含的標記預算,產生了幾個關鍵發現:(1)我們確定了一個新穎的任務擴展法則:隨著推理過程中同時問題數 N 的增加,每個問題的標記使用量單調減少,而準確性比基準更優雅地退化,確立了 N 作為可控的吞吐量維度。(2)BCR 挑戰了傳統的準確性與效率的權衡,通過在標準單問題推理中展示“免費午餐”現象來證明這一點。在 1.5B 和 4B 模型系列中,BCR 將標記使用量減少了 15.8% 到 62.6%,同時在五個主要數學基準中持續維持或提高準確性。(3)質性分析顯示出自我調節的效率,模型自主消除冗餘的元認知循環,而無需明確的長度監督。(4)關鍵的是,我們實證證明隱含的預算約束成功地繞過了明確長度懲罰固有的對抗梯度和災難性優化崩潰,提供了一種高度穩定的基於約束的長度控制替代方案。這些結果證明了 BCR 的實用性,顯示出簡單的結構激勵解鎖了大型語言模型中的潛在高密度推理。

No Single Best Model for Diversity: Learning a Router for Sample Diversity

2604.02319v1 by Yuhan Liu, Fangyuan Xu, Vishakh Padmakumar, Daphne Ippolito, Eunsol Choi

When posed with prompts that permit a large number of valid answers, comprehensively generating them is the first step towards satisfying a wide range of users. In this paper, we study methods to elicit a comprehensive set of valid responses. To evaluate this, we introduce \textbf{diversity coverage}, a metric that measures the total quality scores assigned to each \textbf{unique} answer in the predicted answer set relative to the best possible answer set with the same number of answers. Using this metric, we evaluate 18 LLMs, finding no single model dominates at generating diverse responses to a wide range of open-ended prompts. Yet, per each prompt, there exists a model that outperforms all other models significantly at generating a diverse answer set. Motivated by this finding, we introduce a router that predicts the best model for each query. On NB-Wildchat, our trained router outperforms the single best model baseline (26.3% vs $23.8%). We further show generalization to an out-of-domain dataset (NB-Curated) as well as different answer-generation prompting strategies. Our work lays foundation for studying generating comprehensive answers when we have access to a suite of models.

摘要:當面對允許大量有效答案的提示時,全面生成這些答案是滿足廣泛用戶需求的第一步。本文研究了引出全面有效回應的方法。為了評估這一點,我們引入了 \textbf{多樣性覆蓋},這是一個衡量在預測答案集中每個 \textbf{獨特} 答案相對於具有相同數量答案的最佳可能答案集所分配的總質量分數的指標。使用這個指標,我們評估了18個LLM,發現沒有單一模型在生成對廣泛開放式提示的多樣化回應方面佔據主導地位。然而,對於每個提示,存在一個模型在生成多樣化答案集方面顯著超過所有其他模型。受到這一發現的啟發,我們引入了一個路由器,預測每個查詢的最佳模型。在NB-Wildchat上,我們訓練的路由器超越了單一最佳模型基準(26.3% vs $23.8%)。我們進一步展示了對一個域外數據集(NB-Curated)以及不同答案生成提示策略的泛化。我們的工作為研究在擁有一套模型時生成全面答案奠定了基礎。

Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models

2604.02315v1 by Sarath Shekkizhar, Romain Cosentino, Adam Earle

Standard LLM benchmarks evaluate the assistant turn: the model generates a response to an input, a verifier scores correctness, and the analysis ends. This paradigm leaves unmeasured whether the LLM encodes any awareness of what follows the assistant response. We propose user-turn generation as a probe of this gap: given a conversation context of user query and assistant response, we let a model generate under the user role. If the model's weights encode interaction awareness, the generated user turn will be a grounded follow-up that reacts to the preceding context. Through experiments across $11$ open-weight LLMs (Qwen3.5, gpt-oss, GLM) and $5$ datasets (math reasoning, instruction following, conversation), we show that interaction awareness is decoupled from task accuracy. In particular, within the Qwen3.5 family, GSM8K accuracy scales from $41\%$ ($0.8$B) to $96.8\%$ ($397$B-A$17$B), yet genuine follow-up rates under deterministic generation remain near zero. In contrast, higher temperature sampling reveals interaction awareness is latent with follow up rates reaching $22\%$. Controlled perturbations validate that the proposed probe measures a real property of the model, and collaboration-oriented post-training on Qwen3.5-2B demonstrates an increase in follow-up rates. Our results show that user-turn generation captures a dimension of LLM behavior, interaction awareness, that is unexplored and invisible with current assistant-only benchmarks.

摘要:標準 LLM 基準評估助手回合:模型對輸入生成回應,驗證者評分正確性,分析結束。這種範式未能衡量 LLM 是否編碼了對助手回應後續內容的任何意識。我們提出用戶回合生成作為這一空白的探測:在用戶查詢和助手回應的對話上下文中,我們讓模型在用戶角色下生成。如果模型的權重編碼了互動意識,則生成的用戶回合將是對前述上下文的有根據的跟進。通過在 $11$ 個開放權重 LLM(Qwen3.5、gpt-oss、GLM)和 $5$ 個數據集(數學推理、指令遵循、對話)上的實驗,我們顯示互動意識與任務準確性是解耦的。特別是在 Qwen3.5 系列中,GSM8K 的準確率從 $41\%$($0.8$B)提升至 $96.8\%$($397$B-A$17$B),然而在確定性生成下真正的跟進率仍然接近零。相比之下,更高的溫度抽樣顯示互動意識是潛在的,跟進率達到 $22\%$。控制擾動驗證了所提出的探測器測量模型的一個真實屬性,而針對 Qwen3.5-2B 的協作導向後訓練顯示跟進率的增加。我們的結果表明,用戶回合生成捕捉到 LLM 行為的一個維度——互動意識,這在目前僅有助手的基準中是未被探索和不可見的。

go-$m$HC: Direct Parameterization of Manifold-Constrained Hyper-Connections via Generalized Orthostochastic Matrices

2604.02309v1 by Torque Dandachi, Sophia Diggs-Galligan

Doubly stochastic matrices enable learned mixing across residual streams, but parameterizing the set of doubly stochastic matrices (the Birkhoff polytope) exactly and efficiently remains an open challenge. Existing exact methods scale factorially with the number of streams ($d$), while Kronecker-factorized approaches are efficient but expressivity-limited. We introduce a novel exact parameterization grounded in the theory of generalized orthostochastic matrices, which scales as $\mathcal{O}(d^3)$ and exposes a single hyperparameter $s$ which continuously interpolates between a computationally efficient boundary and the fully expressive Birkhoff polytope. Building on Manifold-Constrained Hyper-Connections ($m$HC), a framework for learned dynamic layer connectivity, we instantiate this parameterization in go-$m$HC. Our method composes naturally with Kronecker-factorized methods, substantially recovering expressivity at similar FLOP costs. Spectral analysis indicates that go-$m$HC fills the Birkhoff polytope far more completely than Kronecker-factorized baselines. On synthetic stream-mixing tasks, go-$m$HC achieves the minimum theoretical loss while converging up to $10\times$ faster. We validate our approach in a 30M parameter GPT-style language model. The expressivity, efficiency, and exactness of go-$m$HC offer a practical avenue for scaling $d$ as a new dimension of model capacity.

摘要:雙重隨機矩陣使得在殘差流之間的混合學習成為可能,但準確且高效地參數化雙重隨機矩陣的集合(Birkhoff 多面體)仍然是一個未解的挑戰。現有的精確方法隨著流的數量 ($d$) 呈階乘增長,而克羅內克因子化方法雖然高效,但表達能力有限。我們引入了一種新的精確參數化,基於廣義正交隨機矩陣的理論,其增長為 $\mathcal{O}(d^3)$,並揭示了一個單一的超參數 $s$,該參數在計算效率邊界和完全表達的 Birkhoff 多面體之間進行連續插值。基於流形約束超連接 ($m$HC) 的框架,我們在 go-$m$HC 中實現了這一參數化。我們的方法自然地與克羅內克因子化方法結合,顯著恢復了相似 FLOP 成本下的表達能力。頻譜分析表明,go-$m$HC 比克羅內克因子化基準更全面地填充了 Birkhoff 多面體。在合成流混合任務中,go-$m$HC 實現了最小的理論損失,同時收斂速度快達 $10\times$。我們在一個 30M 參數的 GPT 風格語言模型中驗證了我們的方法。go-$m$HC 的表達能力、高效性和精確性為擴展 $d$ 作為模型容量的新維度提供了一條實際途徑。

VOID: Video Object and Interaction Deletion

2604.02296v1 by Saman Motamed, William Harvey, Benjamin Klein, Luc Van Gool, Zhuoning Yuan, Ta-Ying Cheng

Existing video object removal methods excel at inpainting content "behind" the object and correcting appearance-level artifacts such as shadows and reflections. However, when the removed object has more significant interactions, such as collisions with other objects, current models fail to correct them and produce implausible results. We present VOID, a video object removal framework designed to perform physically-plausible inpainting in these complex scenarios. To train the model, we generate a new paired dataset of counterfactual object removals using Kubric and HUMOTO, where removing an object requires altering downstream physical interactions. During inference, a vision-language model identifies regions of the scene affected by the removed object. These regions are then used to guide a video diffusion model that generates physically consistent counterfactual outcomes. Experiments on both synthetic and real data show that our approach better preserves consistent scene dynamics after object removal compared to prior video object removal methods. We hope this framework sheds light on how to make video editing models better simulators of the world through high-level causal reasoning.

摘要:現有的視頻物體移除方法在填補物體“背後”的內容和修正外觀級別的工件(如陰影和反射)方面表現出色。然而,當被移除的物體與其他物體之間有更顯著的互動,例如碰撞時,當前模型無法進行修正,並產生不切實際的結果。我們提出了VOID,一個視頻物體移除框架,旨在在這些複雜場景中執行物理上合理的填補。為了訓練模型,我們使用Kubric和HUMOTO生成了一個新的配對數據集,該數據集中的反事實物體移除需要改變下游的物理互動。在推斷過程中,視覺-語言模型識別出受移除物體影響的場景區域。這些區域然後用來指導一個視頻擴散模型,該模型生成物理上一致的反事實結果。在合成數據和真實數據上的實驗表明,我們的方法在物體移除後更好地保持了一致的場景動態,與之前的視頻物體移除方法相比。我們希望這個框架能夠啟發如何通過高層次的因果推理使視頻編輯模型成為更好的世界模擬器。

Omni123: Exploring 3D Native Foundation Models with Limited 3D Data by Unifying Text to 2D and 3D Generation

2604.02289v1 by Chongjie Ye, Cheng Cao, Chuanyu Pan, Yiming Hao, Yihao Zhi, Yuanming Hu, Xiaoguang Han

Recent multimodal large language models have achieved strong performance in unified text and image understanding and generation, yet extending such native capability to 3D remains challenging due to limited data. Compared to abundant 2D imagery, high-quality 3D assets are scarce, making 3D synthesis under-constrained. Existing methods often rely on indirect pipelines that edit in 2D and lift results into 3D via optimization, sacrificing geometric consistency. We present Omni123, a 3D-native foundation model that unifies text-to-2D and text-to-3D generation within a single autoregressive framework. Our key insight is that cross-modal consistency between images and 3D can serve as an implicit structural constraint. By representing text, images, and 3D as discrete tokens in a shared sequence space, the model leverages abundant 2D data as a geometric prior to improve 3D representations. We introduce an interleaved X-to-X training paradigm that coordinates diverse cross-modal tasks over heterogeneous paired datasets without requiring fully aligned text-image-3D triplets. By traversing semantic-visual-geometric cycles (e.g., text to image to 3D to image) within autoregressive sequences, the model jointly enforces semantic alignment, appearance fidelity, and multi-view geometric consistency. Experiments show that Omni123 significantly improves text-guided 3D generation and editing, demonstrating a scalable path toward multimodal 3D world models.

摘要:最近的多模態大型語言模型在統一的文本和圖像理解及生成方面取得了強大的性能,但將這種原生能力擴展到3D仍然面臨挑戰,因為數據有限。與豐富的2D圖像相比,高質量的3D資產稀缺,這使得3D合成受到約束。現有的方法通常依賴於間接管道,在2D中進行編輯,並通過優化將結果提升到3D,犧牲了幾何一致性。我們提出了Omni123,一個3D原生的基礎模型,將文本到2D和文本到3D的生成統一在一個自回歸框架內。我們的關鍵見解是,圖像和3D之間的跨模態一致性可以作為一種隱式結構約束。通過將文本、圖像和3D表示為共享序列空間中的離散標記,該模型利用豐富的2D數據作為幾何先驗來改善3D表示。我們引入了一種交錯的X到X訓練範式,協調多樣的跨模態任務,通過異質配對數據集進行訓練,而不需要完全對齊的文本-圖像-3D三元組。通過在自回歸序列中遍歷語義-視覺-幾何循環(例如,文本到圖像到3D到圖像),該模型共同強化語義對齊、外觀保真度和多視角幾何一致性。實驗表明,Omni123顯著改善了文本引導的3D生成和編輯,展示了邁向多模態3D世界模型的可擴展路徑。

Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing

2604.02288v1 by Gengsheng Li, Tianyu Yang, Junfeng Fang, Mingyang Song, Mao Zheng, Haiyun Guo, Dan Zhang, Jinqiao Wang, Tat-Seng Chua

Reinforcement learning with verifiable rewards (RLVR) has become a standard paradigm for post-training large language models. While Group Relative Policy Optimization (GRPO) is widely adopted, its coarse credit assignment uniformly penalizes failed rollouts, lacking the token-level focus needed to efficiently address specific deviations. Self-Distillation Policy Optimization (SDPO) addresses this by providing denser, more targeted logit-level supervision that facilitates rapid early improvement, yet it frequently collapses during prolonged training. We trace this late-stage instability to two intrinsic flaws: self-distillation on already-correct samples introduces optimization ambiguity, and the self-teacher's signal reliability progressively degrades. To resolve these issues, we propose Sample-Routed Policy Optimization (SRPO), a unified on-policy framework that routes correct samples to GRPO's reward-aligned reinforcement and failed samples to SDPO's targeted logit-level correction. SRPO further incorporates an entropy-aware dynamic weighting mechanism to suppress high-entropy, unreliable distillation targets while emphasizing confident ones. Evaluated across five benchmarks and two model scales, SRPO achieves both the rapid early improvement of SDPO and the long-horizon stability of GRPO. It consistently surpasses the peak performance of both baselines, raising the five-benchmark average on Qwen3-8B by 3.4% over GRPO and 6.3% over SDPO, while simultaneously yielding moderate response lengths and lowering per-step compute cost by up to 17.2%.

摘要:強化學習與可驗證獎勵(RLVR)已成為後訓練大型語言模型的標準範式。雖然群體相對策略優化(GRPO)被廣泛採用,但其粗略的信用分配均勻地懲罰失敗的執行,缺乏針對特定偏差所需的標記級別關注。自我蒸餾策略優化(SDPO)通過提供更密集、更具針對性的邏輯級別監督來解決此問題,促進快速的早期改進,然而在長時間訓練過程中,它經常會崩潰。我們將這一晚期不穩定性追溯到兩個內在缺陷:對已正確樣本進行自我蒸餾引入了優化模糊性,而自我教師的信號可靠性逐漸下降。為了解決這些問題,我們提出了樣本路由策略優化(SRPO),這是一個統一的在線策略框架,將正確樣本路由到GRPO的獎勵對齊強化,並將失敗樣本路由到SDPO的針對性邏輯級別修正。SRPO進一步納入了一個基於熵的動態加權機制,以抑制高熵、不可靠的蒸餾目標,同時強調可靠的目標。在五個基準測試和兩個模型規模的評估中,SRPO同時實現了SDPO的快速早期改進和GRPO的長期穩定性。它始終超越了這兩個基準的峰值性能,將Qwen3-8B在五個基準的平均提升了3.4%相較於GRPO和6.3%相較於SDPO,同時產生適中的響應長度並將每步計算成本降低了最多17.2%。

De Jure: Iterative LLM Self-Refinement for Structured Extraction of Regulatory Rules

2604.02276v1 by Keerat Guliani, Deepkamal Gill, David Landsman, Nima Eshraghi, Krishna Kumar, Lovedeep Gondara

Regulatory documents encode legally binding obligations that LLM-based systems must respect. Yet converting dense, hierarchically structured legal text into machine-readable rules remains a costly, expert-intensive process. We present De Jure, a fully automated, domain-agnostic pipeline for extracting structured regulatory rules from raw documents, requiring no human annotation, domain-specific prompting, or annotated gold data. De Jure operates through four sequential stages: normalization of source documents into structured Markdown; LLM-driven semantic decomposition into structured rule units; multi-criteria LLM-as-a-judge evaluation across 19 dimensions spanning metadata, definitions, and rule semantics; and iterative repair of low-scoring extractions within a bounded regeneration budget, where upstream components are repaired before rule units are evaluated. We evaluate De Jure across four models on three regulatory corpora spanning finance, healthcare, and AI governance. On the finance domain, De Jure yields consistent and monotonic improvement in extraction quality, reaching peak performance within three judge-guided iterations. De Jure generalizes effectively to healthcare and AI governance, maintaining high performance across both open- and closed-source models. In a downstream compliance question-answering evaluation via RAG, responses grounded in De Jure extracted rules are preferred over prior work in 73.8% of cases at single-rule retrieval depth, rising to 84.0% under broader retrieval, confirming that extraction fidelity translates directly into downstream utility. These results demonstrate that explicit, interpretable evaluation criteria can substitute for human annotation in complex regulatory domains, offering a scalable and auditable path toward regulation-grounded LLM alignment.

摘要:法規文件編碼了LLM基礎系統必須遵守的法律約束義務。然而,將密集的、層次結構的法律文本轉換為機器可讀的規則,仍然是一個成本高昂且需要專家的過程。我們提出了De Jure,一個完全自動化的、領域無關的管道,用於從原始文件中提取結構化的法規規則,無需人類標註、領域特定的提示或標註的金標數據。De Jure通過四個連續階段運行:將源文件標準化為結構化的Markdown;LLM驅動的語義分解為結構化的規則單元;跨越19個維度的多標準LLM作為評判者的評估,涵蓋元數據、定義和規則語義;以及在有限的再生預算內對低分提取進行迭代修復,其中上游組件在評估規則單元之前進行修復。我們在三個涵蓋金融、醫療保健和人工智慧治理的法規語料庫上,對四個模型進行了De Jure的評估。在金融領域,De Jure在提取質量上產生了一致且單調的改進,在三次由評判指導的迭代中達到最佳性能。De Jure有效地推廣到醫療保健和人工智慧治理,在開源和閉源模型中均保持高性能。在通過RAG進行的下游合規問題回答評估中,基於De Jure提取規則的回應在單規則檢索深度中比之前的工作更受青睞,比例為73.8%,在更廣泛的檢索中上升至84.0%,確認了提取的忠實度直接轉化為下游效用。這些結果表明,明確的、可解釋的評估標準可以替代複雜法規領域中的人類標註,提供了一條可擴展且可審計的途徑,朝向以法規為基礎的LLM對齊。

Crystalite: A Lightweight Transformer for Efficient Crystal Modeling

2604.02270v1 by Tin Hadži Veljković, Joshua Rosenthal, Ivor Lončarić, Jan-Willem van de Meent

Generative models for crystalline materials often rely on equivariant graph neural networks, which capture geometric structure well but are costly to train and slow to sample. We present Crystalite, a lightweight diffusion Transformer for crystal modeling built around two simple inductive biases. The first is Subatomic Tokenization, a compact chemically structured atom representation that replaces high-dimensional one-hot encodings and is better suited to continuous diffusion. The second is the Geometry Enhancement Module (GEM), which injects periodic minimum-image pair geometry directly into attention through additive geometric biases. Together, these components preserve the simplicity and efficiency of a standard Transformer while making it better matched to the structure of crystalline materials. Crystalite achieves state-of-the-art results on crystal structure prediction benchmarks, and de novo generation performance, attaining the best S.U.N. discovery score among the evaluated baselines while sampling substantially faster than geometry-heavy alternatives.

摘要:生成晶體材料的模型通常依賴於等變圖神經網絡,這些網絡能夠很好地捕捉幾何結構,但訓練成本高且取樣速度慢。我們提出了Crystalite,一種輕量級的擴散Transformer,用於晶體建模,基於兩個簡單的歸納偏見。第一個是亞原子標記化,一種緊湊的化學結構原子表示,取代了高維的獨熱編碼,更適合連續擴散。第二個是幾何增強模塊(GEM),通過附加幾何偏見,將周期性最小影像對幾何直接注入注意力中。這些組件共同保持了標準Transformer的簡單性和效率,同時使其更適合晶體材料的結構。Crystalite在晶體結構預測基準測試和新穎生成性能上達到了最先進的結果,在評估的基準中獲得了最佳的S.U.N.發現分數,同時取樣速度顯著快於以幾何為重的替代方案。

Retrieval-Augmented Question Answering over Scientific Literature for the Electron-Ion Collider

2604.02259v1 by Tina. J. Jat, T. Ghosh, Karthik Suresh

To harness the power of Language Models in answering domain specific specialized technical questions, Retrieval Augmented Generation (RAG) is been used widely. In this work, we have developed a Q\&A application inspired by the Retrieval Augmented Generation (RAG), which is comprised of an in-house database indexed on the arXiv articles related to the Electron-Ion Collider (EIC) experiment - one of the largest international scientific collaboration and incorporated an open-source LLaMA model for answer generation. This is an extension to it's proceeding application built on proprietary model and Cloud-hosted external knowledge-base for the EIC experiment. This locally-deployed RAG-system offers a cost-effective, resource-constraint alternative solution to build a RAG-assisted Q\&A application on answering domain-specific queries in the field of experimental nuclear physics. This set-up facilitates data-privacy, avoids sending any pre-publication scientific data and information to public domain. Future improvement will expand the knowledge base to encompass heterogeneous EIC-related publications and reports and upgrade the application pipeline orchestration to the LangGraph framework.

摘要:為了利用語言模型的力量來回答特定領域的專業技術問題,檢索增強生成(RAG)被廣泛使用。在這項工作中,我們開發了一個受到檢索增強生成(RAG)啟發的問答應用程式,該應用程式由一個內部數據庫組成,該數據庫索引了與電子-離子對撞機(EIC)實驗相關的arXiv文章——這是最大的國際科學合作之一,並結合了一個開源的LLaMA模型來生成答案。這是對其先前應用的擴展,該應用基於專有模型和雲端托管的外部知識庫,用於EIC實驗。這個本地部署的RAG系統提供了一種具成本效益的資源受限替代方案,以建立一個RAG輔助的問答應用程式,回答實驗核物理領域的特定查詢。這一設置促進了數據隱私,避免將任何未發表的科學數據和信息發送到公共領域。未來的改進將擴展知識庫,以涵蓋異質的EIC相關出版物和報告,並將應用管道編排升級到LangGraph框架。

Generative AI Spotlights the Human Core of Data Science: Implications for Education

2604.02238v1 by Nathan Taback

Generative AI (GAI) reveals an irreducible human core at the center of data science: advances in GAI should sharpen, rather than diminish, the focus on human reasoning in data science education. GAI can now execute many routine data science workflows, including cleaning, summarizing, visualizing, modeling, and drafting reports. Yet the competencies that matter most remain irreducibly human: problem formulation, measurement and design, causal identification, statistical and computational reasoning, ethics and accountability, and sensemaking. Drawing on Donoho's Greater Data Science framework, Nolan and Temple Lang's vision of computational literacy, and the McLuhan-Culkin insight that we shape our tools and thereafter our tools shape us, this paper traces the emergence of data science through three converging lineages: Tukey's intellectual vision of data analysis as a science, the commercial logic of surveillance capitalism that created industrial demand for data scientists, and the academic programs that followed. Mapping GAI's impact onto Donoho's six divisions of Greater Data Science shows that computing with data (GDS3) has been substantially automated, while data gathering, preparation, and exploration (GDS1) and science about data science (GDS6) still require essential human input. The educational implication is that data science curricula should focus on this human core while teaching students how to contribute effectively within iterative prompt-output-prompt cycles using retrieval-augmented generation, and that learning outcomes and assessments should explicitly evaluate reasoning and judgment.

摘要:生成式人工智慧(GAI)揭示了數據科學中心的一個不可還原的人類核心:GAI的進步應該加強,而不是減少,對數據科學教育中人類推理的關注。GAI現在可以執行許多常規的數據科學工作流程,包括清理、總結、可視化、建模和撰寫報告。然而,最重要的能力仍然是不可還原的人類能力:問題的形成、測量和設計、因果識別、統計和計算推理、倫理和問責,以及意義建構。基於Donoho的更大數據科學框架、Nolan和Temple Lang對計算素養的願景,以及McLuhan-Culkin的洞見,即我們塑造工具,然後工具塑造我們,本文追溯了數據科學的出現,通過三個交匯的血統:Tukey對數據分析作為一門科學的智識願景、創造數據科學家工業需求的監控資本主義的商業邏輯,以及隨之而來的學術計劃。將GAI的影響映射到Donoho的六個更大數據科學部門顯示,計算數據(GDS3)已經實質上自動化,而數據收集、準備和探索(GDS1)以及關於數據科學的科學(GDS6)仍然需要基本的人類輸入。教育上的含義是,數據科學課程應該專注於這個人類核心,同時教導學生如何在使用檢索增強生成的迭代提示-輸出-提示循環中有效貢獻,並且學習成果和評估應明確評估推理和判斷。

Do Emotions in Prompts Matter? Effects of Emotional Framing on Large Language Models

2604.02236v1 by Minda Zhao, Yutong Yang, Chufei Peng, Rachel Gonsalves, Weiyue Li, Ruyi Yang, Zhixi Liu, Mengyu Wang

Emotional tone is pervasive in human communication, yet its influence on large language model (LLM) behaviour remains unclear. Here, we examine how first-person emotional framing in user-side queries affect LLM performance across six benchmark domains, including mathematical reasoning, medical question answering, reading comprehension, commonsense reasoning and social inference. Across models and tasks, static emotional prefixes usually produce only small changes in accuracy, suggesting that affective phrasing is typically a mild perturbation rather than a reliable general-purpose intervention. This stability is not uniform: effects are more variable in socially grounded tasks, where emotional context more plausibly interacts with interpersonal reasoning. Additional analyses show that stronger emotional wording induces only modest extra change, and that human-written prefixes reproduce the same qualitative pattern as LLM-generated ones. We then introduce EmotionRL, an adaptive emotional prompting framework that selects emotional framing adaptively for each query. Although no single emotion is consistently beneficial, adaptive selection yields more reliable gains than fixed emotional prompting. Together, these findings show that emotional tone is neither a dominant driver of LLM performance nor irrelevant noise, but a weak and input-dependent signal that can be exploited through adaptive control.

摘要:情感語調在人類溝通中無處不在,但其對大型語言模型(LLM)行為的影響仍不明確。在這裡,我們檢視第一人稱情感框架在用戶端查詢中如何影響LLM在六個基準領域的表現,包括數學推理、醫療問答、閱讀理解、常識推理和社會推斷。在模型和任務中,靜態情感前綴通常只會產生微小的準確性變化,這表明情感措辭通常是一種輕微的擾動,而不是可靠的通用干預。這種穩定性並不均勻:在社會性基礎的任務中,效果變化更大,因為情感背景更可能與人際推理互動。額外的分析顯示,較強的情感措辭僅引發適度的額外變化,而人類撰寫的前綴重現了與LLM生成的前綴相同的質量模式。然後,我們介紹EmotionRL,一種自適應情感提示框架,根據每個查詢自適應地選擇情感框架。儘管沒有單一情感始終如一地有益,但自適應選擇比固定的情感提示產生更可靠的增益。總體而言,這些發現顯示情感語調既不是LLM表現的主導驅動力,也不是無關的噪音,而是一種微弱且依賴於輸入的信號,可以通過自適應控制來利用。

Answering the Wrong Question: Reasoning Trace Inversion for Abstention in LLMs

2604.02230v1 by Abinitha Gourabathina, Inkit Padhi, Manish Nagireddy, Subhajit Chaudhury, Prasanna Sattigeri

For Large Language Models (LLMs) to be reliably deployed, models must effectively know when not to answer: abstain. Reasoning models, in particular, have gained attention for impressive performance on complex tasks. However, reasoning models have been shown to have worse abstention abilities. Taking the vulnerabilities of reasoning models into account, we propose our Query Misalignment Framework. Hallucinations resulting in failed abstention can be reinterpreted as LLMs answering the wrong question (rather than answering a question incorrectly). Based on this framework, we develop a new class of state-of-the-art abstention methods called Trace Inversion. First, we generate the reasoning trace of a model. Based on only the trace, we then reconstruct the most likely query that the model responded to. Finally, we compare the initial query with the reconstructed query. Low similarity score between the initial query and reconstructed query suggests that the model likely answered the question incorrectly and is flagged to abstain. Extensive experiments demonstrate that Trace Inversion effectively boosts abstention performance in four frontier LLMs across nine abstention QA datasets, beating competitive baselines in 33 out of 36 settings.

摘要:大型語言模型(LLMs)要可靠地部署,模型必須有效地知道何時不回答:保持沉默。推理模型特別受到關注,因為它們在複雜任務上表現出色。然而,研究表明推理模型的保持沉默能力較差。考慮到推理模型的脆弱性,我們提出了查詢錯位框架。導致失敗保持沉默的幻覺可以重新詮釋為LLMs回答了錯誤的問題(而不是錯誤地回答了一個問題)。基於這一框架,我們開發了一種新的最先進的保持沉默方法,稱為追蹤反轉。首先,我們生成模型的推理追蹤。僅根據這個追蹤,我們然後重建模型回應的最可能查詢。最後,我們將初始查詢與重建的查詢進行比較。初始查詢和重建查詢之間的低相似度分數表明模型可能錯誤地回答了問題,並被標記為保持沉默。大量實驗表明,追蹤反轉在九個保持沉默的問答數據集上有效提升了四個前沿LLMs的保持沉默性能,在36個設置中有33個超越了競爭基準。

When to ASK: Uncertainty-Gated Language Assistance for Reinforcement Learning

2604.02226v1 by Juarez Monteiro, Nathan Gavenski, Gianlucca Zuin, Adriano Veloso

Reinforcement learning (RL) agents often struggle with out-of-distribution (OOD) scenarios, leading to high uncertainty and random behavior. While language models (LMs) contain valuable world knowledge, larger ones incur high computational costs, hindering real-time use, and exhibit limitations in autonomous planning. We introduce Adaptive Safety through Knowledge (ASK), which combines smaller LMs with trained RL policies to enhance OOD generalization without retraining. ASK employs Monte Carlo Dropout to assess uncertainty and queries the LM for action suggestions only when uncertainty exceeds a set threshold. This selective use preserves the efficiency of existing policies while leveraging the language model's reasoning in uncertain situations. In experiments on the FrozenLake environment, ASK shows no improvement in-domain, but demonstrates robust navigation in transfer tasks, achieving a reward of 0.95. Our findings indicate that effective neuro-symbolic integration requires careful orchestration rather than simple combination, highlighting the need for sufficient model scale and effective hybridization mechanisms for successful OOD generalization.

摘要:強化學習(RL)代理在處理分佈外(OOD)情境時常常面臨困難,導致高度的不確定性和隨機行為。雖然語言模型(LM)包含有價值的世界知識,但較大的模型會產生高計算成本,妨礙實時使用,並在自主規劃方面顯示出限制。我們引入了通過知識的自適應安全(ASK),它將較小的LM與訓練過的RL策略結合,以增強OOD泛化而無需重新訓練。ASK採用蒙特卡羅隨機失活來評估不確定性,並僅在不確定性超過設定閾值時查詢LM以獲取行動建議。這種選擇性使用保留了現有策略的效率,同時利用語言模型在不確定情況下的推理能力。在FrozenLake環境的實驗中,ASK在領域內沒有顯示出改善,但在轉移任務中顯示出穩健的導航,獲得了0.95的獎勵。我們的研究結果表明,有效的神經符號整合需要謹慎的協調,而非簡單的組合,突顯了成功的OOD泛化所需的足夠模型規模和有效的混合機制。

Impact of Multimodal and Conversational AI on Learning Outcomes and Experience

2604.02221v1 by Karan Taneja, Anjali Singh, Ashok K. Goel

Multimodal Large Language Models (MLLMs) offer an opportunity to support multimedia learning through conversational systems grounded in educational content. However, while conversational AI is known to boost engagement, its impact on learning in visually-rich STEM domains remains under-explored. Moreover, there is limited understanding of how multimodality and conversationality jointly influence learning in generative AI systems. This work reports findings from a randomized controlled online study (N = 124) comparing three approaches to learning biology from textbook content: (1) a document-grounded conversational AI with interleaved text-and-image responses (MuDoC), (2) a document-grounded conversational AI with text-only responses (TexDoC), and (3) a textbook interface with semantic search and highlighting (DocSearch). Learners using MuDoC achieved the highest post-test scores and reported the most positive learning experience. Notably, while TexDoC was rated as significantly more engaging and easier to use than DocSearch, it led to the lowest post-test scores, revealing a disconnect between student perceptions and learning outcomes. Interpreted through the lens of the Cognitive Load Theory, these findings suggest that conversationality reduces extraneous load, while visual-verbal integration induced by multimodality increases germane load, leading to better learning outcomes. When conversationality is not complemented by multimodality, reduced cognitive effort may instead inflate perceived understanding without improving learning outcomes.

摘要:多模態大型語言模型(MLLMs)提供了一個支持通過基於教育內容的對話系統進行多媒體學習的機會。然而,儘管對話式人工智慧被認為能提升參與度,但其對視覺豐富的STEM領域學習的影響仍然未被充分探討。此外,對於多模態性和對話性如何共同影響生成式人工智慧系統中的學習,理解仍然有限。本研究報告了一項隨機對照在線研究的結果(N = 124),比較了三種從教科書內容學習生物學的方法:(1) 一種基於文件的對話式人工智慧,具有交錯的文本和圖像回應(MuDoC),(2) 一種基於文件的對話式人工智慧,僅提供文本回應(TexDoC),(3) 一種具有語義搜索和高亮功能的教科書界面(DocSearch)。使用MuDoC的學習者在後測中獲得了最高的分數,並報告了最積極的學習體驗。值得注意的是,儘管TexDoC被評價為比DocSearch更具吸引力且更易於使用,但它導致了最低的後測分數,顯示出學生的感知與學習結果之間的脫節。通過認知負荷理論的視角解釋,這些發現表明,對話性減少了多餘負荷,而多模態性引起的視覺-語言整合則增加了相關負荷,從而導致更好的學習結果。當對話性未能與多模態性互補時,減少的認知努力可能會膨脹感知的理解,而不改善學習結果。

VISTA: Visualization of Token Attribution via Efficient Analysis

2604.02217v1 by Syed Ahmed, Bharathi Vokkaliga Ganesh, Jagadish Babu P, Karthick Selvaraj, Praneeth Talluri, Sanket Hingne, Anubhav Kumar, Anushka Yadav, Pratham Kumar Verma, Kiranmayee Janardhan, Mandanna A N

Understanding how Large Language Models (LLMs) process information from prompts remains a significant challenge. To shed light on this "black box," attention visualization techniques have been developed to capture neuron-level perceptions and interpret how models focus on different parts of input data. However, many existing techniques are tailored to specific model architectures, particularly within the Transformer family, and often require backpropagation, resulting in nearly double the GPU memory usage and increased computational cost. A lightweight, model-agnostic approach for attention visualization remains lacking. In this paper, we introduce a model-agnostic token importance visualization technique to better understand how generative AI systems perceive and prioritize information from input text, without incurring additional computational cost. Our method leverages perturbation-based strategies combined with a three-matrix analytical framework to generate relevance maps that illustrate token-level contributions to model predictions. The framework comprises: (1) the Angular Deviation Matrix, which captures shifts in semantic direction; (2) the Magnitude Deviation Matrix, which measures changes in semantic intensity; and (3) the Dimensional Importance Matrix, which evaluates contributions across individual vector dimensions. By systematically removing each token and measuring the resulting impact across these three complementary dimensions, we derive a composite importance score that provides a nuanced and mathematically grounded measure of token significance. To support reproducibility and foster wider adoption, we provide open-source implementations of all proposed and utilized explainability techniques, with code and resources publicly available at https://github.com/Infosys/Infosys-Responsible-AI-Toolkit

摘要:理解大型語言模型(LLMs)如何處理來自提示的信息仍然是一個重大挑戰。為了揭示這個「黑箱」,已開發出注意力可視化技術,以捕捉神經元級的感知並解釋模型如何專注於輸入數據的不同部分。然而,許多現有技術是針對特定模型架構量身定制的,特別是在Transformer家族內,並且通常需要反向傳播,導致幾乎雙倍的GPU內存使用和增加的計算成本。缺乏一種輕量級的、與模型無關的注意力可視化方法。在本文中,我們介紹了一種與模型無關的標記重要性可視化技術,以更好地理解生成式AI系統如何感知和優先考慮來自輸入文本的信息,而不增加額外的計算成本。我們的方法利用基於擾動的策略,結合三個矩陣的分析框架,生成顯示標記對模型預測貢獻的相關性圖。該框架包括:(1)角度偏差矩陣,用於捕捉語義方向的變化;(2)幅度偏差矩陣,用於測量語義強度的變化;以及(3)維度重要性矩陣,用於評估各個向量維度的貢獻。通過系統地移除每個標記並測量在這三個互補維度上的結果影響,我們得出一個綜合重要性分數,提供了一種細緻且數學上有根據的標記重要性度量。為了支持可重複性並促進更廣泛的採用,我們提供了所有提議和使用的可解釋性技術的開源實現,代碼和資源可在 https://github.com/Infosys/Infosys-Responsible-AI-Toolkit 上公開獲得。

Universal Hypernetworks for Arbitrary Models

2604.02215v1 by Xuanfeng Zhou

Conventional hypernetworks are typically engineered around a specific base-model parameterization, so changing the target architecture often entails redesigning the hypernetwork and retraining it from scratch. We introduce the \emph{Universal Hypernetwork} (UHN), a fixed-architecture generator that predicts weights from deterministic parameter, architecture, and task descriptors. This descriptor-based formulation decouples the generator architecture from target-network parameterization, so one generator can instantiate heterogeneous models across the tested architecture and task families. Our empirical claims are threefold: (1) one fixed UHN remains competitive with direct training across vision, graph, text, and formula-regression benchmarks; (2) the same UHN supports both multi-model generalization within a family and multi-task learning across heterogeneous models; and (3) UHN enables stable recursive generation with up to three intermediate generated UHNs before the final base model. Our code is available at https://github.com/Xuanfeng-Zhou/UHN.

摘要:傳統的超網絡通常是圍繞特定的基礎模型參數化而設計,因此更改目標架構通常需要重新設計超網絡並從頭開始進行訓練。我們介紹了\emph{通用超網絡}(UHN),這是一個固定架構的生成器,能夠根據確定性的參數、架構和任務描述符預測權重。這種基於描述符的公式將生成器架構與目標網絡參數化解耦,這樣一個生成器就可以在測試的架構和任務家族中實例化異質模型。我們的實證主張有三個方面:(1)一個固定的UHN在視覺、圖形、文本和公式回歸基準測試中與直接訓練保持競爭力;(2)相同的UHN支持在一個家族內的多模型泛化以及跨異質模型的多任務學習;(3)UHN使得在最終基礎模型之前能夠穩定地進行多達三個中間生成的UHN的遞歸生成。我們的代碼可在https://github.com/Xuanfeng-Zhou/UHN獲得。

Multi-Agent Video Recommenders: Evolution, Patterns, and Open Challenges

2604.02211v1 by Srivaths Ranganathan, Abhishek Dharmaratnakar, Anushree Sinha, Debanshu Das

Video recommender systems are among the most popular and impactful applications of AI, shaping content consumption and influencing culture for billions of users. Traditional single-model recommenders, which optimize static engagement metrics, are increasingly limited in addressing the dynamic requirements of modern platforms. In response, multi-agent architectures are redefining how video recommender systems serve, learn, and adapt to both users and datasets. These agent-based systems coordinate specialized agents responsible for video understanding, reasoning, memory, and feedback, to provide precise, explainable recommendations. In this survey, we trace the evolution of multi-agent video recommendation systems (MAVRS). We combine ideas from multi-agent recommender systems, foundation models, and conversational AI, culminating in the emerging field of large language model (LLM)-powered MAVRS. We present a taxonomy of collaborative patterns and analyze coordination mechanisms across diverse video domains, ranging from short-form clips to educational platforms. We discuss representative frameworks, including early multi-agent reinforcement learning (MARL) systems such as MMRF and recent LLM-driven architectures like MACRec and Agent4Rec, to illustrate these patterns. We also outline open challenges in scalability, multimodal understanding, incentive alignment, and identify research directions such as hybrid reinforcement learning-LLM systems, lifelong personalization and self-improving recommender systems.

摘要:視頻推薦系統是人工智慧中最受歡迎和影響力最大的應用之一,塑造了數十億用戶的內容消費並影響文化。傳統的單一模型推薦系統,優化靜態參與指標,越來越難以滿足現代平台的動態需求。為此,多代理架構正在重新定義視頻推薦系統如何為用戶和數據集提供服務、學習和適應。這些基於代理的系統協調專門的代理,負責視頻理解、推理、記憶和反饋,以提供精確且可解釋的推薦。在這項調查中,我們追溯了多代理視頻推薦系統(MAVRS)的演變。我們結合了多代理推薦系統、基礎模型和對話式人工智慧的理念,最終形成了大型語言模型(LLM)驅動的MAVRS的新興領域。我們提出了一個合作模式的分類法,並分析了在不同視頻領域中的協調機制,這些領域從短片到教育平台不等。我們討論了代表性的框架,包括早期的多代理強化學習(MARL)系統如MMRF,以及最近的LLM驅動架構如MACRec和Agent4Rec,以說明這些模式。我們還概述了在可擴展性、多模態理解、激勵對齊方面的挑戰,並確定了研究方向,如混合強化學習-LLM系統、終身個性化和自我改善的推薦系統。

CV-18 NER: Augmented Common Voice for Named Entity Recognition from Arabic Speech

2604.02209v1 by Youssef Saidi, Haroun Elleuch, Fethi Bougares

End-to-end speech Named Entity Recognition (NER) aims to directly extract entities from speech. Prior work has shown that end-to-end (E2E) approaches can outperform cascaded pipelines for English, French, and Chinese, but Arabic remains under-explored due to its morphological complexity, the absence of short vowels, and limited annotated resources. We introduce CV-18 NER, the first publicly available dataset for NER from Arabic speech, created by augmenting the Arabic Common Voice 18 corpus with manual NER annotations following the fine-grained Wojood schema (21 entity types). We benchmark both pipeline systems (ASR + text NER) and E2E models based on Whisper and AraBEST-RQ. E2E systems substantially outperform the best pipeline configuration on the test set, reaching 37.0% CoER (AraBEST-RQ 300M) and 38.0% CVER (Whisper-medium). Further analysis shows that Arabic-specific self-supervised pretraining yields strong ASR performance, while multilingual weak supervision transfers more effectively to joint speech-to-entity learning, and that larger models may be harder to adapt in this low-resource setting. Our dataset and models are publicly released, providing the first open benchmark for end-to-end named entity recognition from Arabic speech https://huggingface.co/datasets/Elyadata/CV18-NER.

摘要:端到端語音命名實體識別(NER)旨在直接從語音中提取實體。先前的研究顯示,端到端(E2E)方法在英語、法語和中文中可以超越級聯管道,但由於阿拉伯語的形態學複雜性、缺乏短母音以及有限的標註資源,阿拉伯語仍然未被充分探索。我們介紹了CV-18 NER,這是第一個可公開獲得的阿拉伯語語音NER數據集,通過增強阿拉伯語Common Voice 18語料庫並根據細粒度的Wojood架構(21種實體類型)進行手動NER標註而創建。我們基於Whisper和AraBEST-RQ對管道系統(ASR + 文本NER)和E2E模型進行基準測試。E2E系統在測試集上顯著超越了最佳管道配置,達到37.0%的CoER(AraBEST-RQ 300M)和38.0%的CVER(Whisper-medium)。進一步分析顯示,針對阿拉伯語的自我監督預訓練帶來了強大的ASR性能,而多語言的弱監督則更有效地轉移到聯合語音到實體的學習中,並且在這種低資源環境中,更大的模型可能更難以適應。我們的數據集和模型已公開發布,提供了第一個針對阿拉伯語語音的端到端命名實體識別的開放基準 https://huggingface.co/datasets/Elyadata/CV18-NER。

Blinded Radiologist and LLM-Based Evaluation of LLM-Generated Japanese Translations of Chest CT Reports: Comparative Study

2604.02207v1 by Yosuke Yamagishi, Atsushi Takamatsu, Yasunori Hamaguchi, Tomohiro Kikuchi, Shouhei Hanaoka, Takeharu Yoshikawa, Osamu Abe

Background: Accurate translation of radiology reports is important for multilingual research, clinical communication, and radiology education, but the validity of LLM-based evaluation remains unclear. Objective: To evaluate the educational suitability of LLM-generated Japanese translations of chest CT reports and compare radiologist assessments with LLM-as-a-judge evaluations. Methods: We analyzed 150 chest CT reports from the CT-RATE-JPN validation set. For each English report, a human-edited Japanese translation was compared with an LLM-generated translation by DeepSeek-V3.2. A board-certified radiologist and a radiology resident independently performed blinded pairwise evaluations across 4 criteria: terminology accuracy, readability, overall quality, and radiologist-style authenticity. In parallel, 3 LLM judges (DeepSeek-V3.2, Mistral Large 3, and GPT-5) evaluated the same pairs. Agreement was assessed using QWK and percentage agreement. Results: Agreement between radiologists and LLM judges was near zero (QWK=-0.04 to 0.15). Agreement between the 2 radiologists was also poor (QWK=0.01 to 0.06). Radiologist 1 rated terminology as equivalent in 59% of cases and favored the LLM translation for readability (51%) and overall quality (51%). Radiologist 2 rated readability as equivalent in 75% of cases and favored the human-edited translation for overall quality (40% vs 21%). All 3 LLM judges strongly favored the LLM translation across all criteria (70%-99%) and rated it as more radiologist-like in >93% of cases. Conclusions: LLM-generated translations were often judged natural and fluent, but the 2 radiologists differed substantially. LLM-as-a-judge showed strong preference for LLM output and negligible agreement with radiologists. For educational use of translated radiology reports, automated LLM-based evaluation alone is insufficient; expert radiologist review remains important.

摘要:背景:準確翻譯放射學報告對於多語言研究、臨床溝通和放射學教育至關重要,但基於大型語言模型(LLM)的評估有效性仍不清楚。目標:評估LLM生成的胸部CT報告日文翻譯的教育適用性,並比較放射科醫生的評估與LLM作為評判者的評估。方法:我們分析了來自CT-RATE-JPN驗證集的150份胸部CT報告。對於每份英文報告,將人工編輯的日文翻譯與DeepSeek-V3.2生成的翻譯進行比較。一名經過認證的放射科醫生和一名放射科住院醫師獨立進行了盲評,根據四個標準進行配對評估:術語準確性、可讀性、整體質量和放射科醫生風格的真實性。同時,三名LLM評判者(DeepSeek-V3.2、Mistral Large 3和GPT-5)對相同的配對進行評估。使用QWK和百分比一致性評估協議。結果:放射科醫生與LLM評判者之間的協議接近於零(QWK=-0.04至0.15)。兩名放射科醫生之間的協議也很差(QWK=0.01至0.06)。放射科醫生1在59%的案例中將術語評為等同,並偏好LLM翻譯的可讀性(51%)和整體質量(51%)。放射科醫生2在75%的案例中將可讀性評為等同,並偏好人工編輯的翻譯在整體質量上(40%對21%)。所有三名LLM評判者在所有標準上都強烈偏好LLM翻譯(70%-99%),並在超過93%的案例中將其評為更像放射科醫生的翻譯。結論:LLM生成的翻譯通常被評為自然流暢,但兩名放射科醫生的評價存在顯著差異。LLM作為評判者對LLM輸出表現出強烈偏好,與放射科醫生的協議微不足道。對於翻譯放射學報告的教育使用,僅依賴自動化的LLM基於評估是不夠的;專家放射科醫生的審查仍然很重要。

LEO: Graph Attention Network based Hybrid Multi Sensor Extended Object Fusion and Tracking for Autonomous Driving Applications

2604.02206v1 by Mayank Mayank, Bharanidhar Duraisamy, Florian Geiss

Accurate shape and trajectory estimation of dynamic objects is essential for reliable automated driving. Classical Bayesian extended-object models offer theoretical robustness and efficiency but depend on completeness of a-priori and update-likelihood functions, while deep learning methods bring adaptability at the cost of dense annotations and high compute. We bridge these strengths with LEO (Learned Extension of Objects), a spatio-temporal Graph Attention Network that fuses multi-modal production-grade sensor tracks to learn adaptive fusion weights, ensure temporal consistency, and represent multi-scale shapes. Using a task-specific parallelogram ground-truth formulation, LEO models complex geometries (e.g. articulated trucks and trailers) and generalizes across sensor types, configurations, object classes, and regions, remaining robust for challenging and long-range targets. Evaluations on the Mercedes-Benz DRIVE PILOT SAE L3 dataset demonstrate real-time computational efficiency suitable for production systems; additional validation on public datasets such as View of Delft (VoD) further confirms cross-dataset generalization.

摘要:準確的動態物體形狀和軌跡估計對於可靠的自動駕駛至關重要。傳統的貝葉斯擴展物體模型提供了理論上的穩健性和效率,但依賴於先驗和更新似然函數的完整性,而深度學習方法則以密集標註和高計算成本為代價帶來了適應性。我們通過LEO(物體的學習擴展)架起這些優勢的橋樑,這是一種時空圖注意力網絡,融合多模態生產級傳感器軌跡以學習自適應融合權重,確保時間一致性並表示多尺度形狀。使用特定任務的平行四邊形真實值公式,LEO建模複雜的幾何形狀(例如關節式卡車和拖車),並在傳感器類型、配置、物體類別和區域之間進行泛化,對於挑戰性和長距離目標保持穩健性。在梅賽德斯-奔馳DRIVE PILOT SAE L3數據集上的評估展示了適合生產系統的實時計算效率;在公共數據集如代爾夫特視圖(VoD)上的額外驗證進一步確認了跨數據集的泛化能力。

Towards Position-Robust Talent Recommendation via Large Language Models

2604.02200v1 by Silin Du, Hongyan Liu

Talent recruitment is a critical, yet costly process for many industries, with high recruitment costs and long hiring cycles. Existing talent recommendation systems increasingly adopt large language models (LLMs) due to their remarkable language understanding capabilities. However, most prior approaches follow a pointwise paradigm, which requires LLMs to repeatedly process some text and fails to capture the relationships among candidates in the list, resulting in higher token consumption and suboptimal recommendations. Besides, LLMs exhibit position bias and the lost-in-the-middle issue when answering multiple-choice questions and processing multiple long documents. To address these issues, we introduce an implicit strategy to utilize LLM's potential output for the recommendation task and propose L3TR, a novel framework for listwise talent recommendation with LLMs. In this framework, we propose a block attention mechanism and a local positional encoding method to enhance inter-document processing and mitigate the position bias and concurrent token bias issue. We also introduce an ID sampling method for resolving the inconsistency between candidate set sizes in the training phase and the inference phase. We design evaluation methods to detect position bias and token bias and training-free debiasing methods. Extensive experiments on two real-world datasets validated the effectiveness of L3TR, showing consistent improvements over existing baselines.

摘要:人才招聘對許多行業來說是一個關鍵但昂貴的過程,具有高昂的招聘成本和漫長的招聘週期。現有的人才推薦系統越來越多地採用大型語言模型(LLMs),因為它們卓越的語言理解能力。然而,大多數先前的方法遵循點對點的範式,這要求LLMs重複處理某些文本,並未能捕捉列表中候選人之間的關係,導致更高的標記消耗和次優的推薦。此外,LLMs在回答多選題和處理多個長文檔時會表現出位置偏差和中間迷失問題。為了解決這些問題,我們引入了一種隱式策略,以利用LLM的潛在輸出來進行推薦任務,並提出L3TR,一個針對LLMs的全新列表人才推薦框架。在這個框架中,我們提出了一種區塊注意機制和一種局部位置編碼方法,以增強文檔之間的處理並減輕位置偏差和同時標記偏差問題。我們還引入了一種ID抽樣方法,以解決訓練階段和推理階段候選集大小之間的不一致。我們設計了評估方法來檢測位置偏差和標記偏差,以及無需訓練的去偏方法。在兩個真實世界數據集上的廣泛實驗驗證了L3TR的有效性,顯示出相對於現有基準的一致改進。

Neuro-RIT: Neuron-Guided Instruction Tuning for Robust Retrieval-Augmented Language Model

2604.02194v1 by Jaemin Kim, Jae O Lee, Sumyeong Ahn, Seo Yeon Park

Retrieval-Augmented Language Models (RALMs) have demonstrated significant potential in knowledge-intensive tasks; however, they remain vulnerable to performance degradation when presented with irrelevant or noisy retrieved contexts. Existing approaches to enhance robustness typically operate via coarse-grained parameter updates at the layer or module level, often overlooking the inherent neuron-level sparsity of Large Language Models (LLMs). To address this limitation, we propose Neuro-RIT (Neuron-guided Robust Instruction Tuning), a novel framework that shifts the paradigm from dense adaptation to precision-driven neuron alignment. Our method explicitly disentangles neurons that are responsible for processing relevant versus irrelevant contexts using attribution-based neuron mining. Subsequently, we introduce a two-stage instruction tuning strategy that enforces a dual capability for noise robustness: achieving direct noise suppression by functionally deactivating neurons exclusive to irrelevant contexts, while simultaneously optimizing targeted layers for evidence distillation. Extensive experiments across diverse QA benchmarks demonstrate that Neuro-RIT consistently outperforms strong baselines and robustness-enhancing methods.

摘要:檢索增強語言模型(RALMs)在知識密集型任務中顯示出顯著的潛力;然而,當面對不相關或噪聲的檢索上下文時,它們仍然容易出現性能下降。現有增強穩健性的方案通常通過層或模塊級的粗粒度參數更新來運作,往往忽略了大型語言模型(LLMs)固有的神經元級稀疏性。為了解決這一限制,我們提出了神經引導穩健指令調整(Neuro-RIT),這是一個新穎的框架,將範式從密集適應轉變為以精確驅動的神經元對齊。我們的方法明確區分了負責處理相關與不相關上下文的神經元,使用基於歸因的神經元挖掘。隨後,我們引入了一種兩階段的指令調整策略,強化了噪聲穩健性的雙重能力:通過功能性地停用專門處理不相關上下文的神經元來實現直接的噪聲抑制,同時優化針對證據蒸餾的目標層。廣泛的實驗跨越多個問答基準顯示,Neuro-RIT 始終超越強基準和增強穩健性的方法。

TRU: Targeted Reverse Update for Efficient Multimodal Recommendation Unlearning

2604.02183v1 by Zhanting Zhou, KaHou Tam, Ziqiang Zheng, Zeyu Ma, Zhanting Zhou

Multimodal recommendation systems (MRS) jointly model user-item interaction graphs and rich item content, but this tight coupling makes user data difficult to remove once learned. Approximate machine unlearning offers an efficient alternative to full retraining, yet existing methods for MRS mainly rely on a largely uniform reverse update across the model. We show that this assumption is fundamentally mismatched to modern MRS: deleted-data influence is not uniformly distributed, but concentrated unevenly across \textit{ranking behavior}, \textit{modality branches}, and \textit{network layers}. This non-uniformity gives rise to three bottlenecks in MRS unlearning: target-item persistence in the collaborative graph, modality imbalance across feature branches, and layer-wise sensitivity in the parameter space. To address this mismatch, we propose \textbf{targeted reverse update} (TRU), a plug-and-play unlearning framework for MRS. Instead of applying a blind global reversal, TRU performs three coordinated interventions across the model hierarchy: a ranking fusion gate to suppress residual target-item influence in ranking, branch-wise modality scaling to preserve retained multimodal representations, and capacity-aware layer isolation to localize reverse updates to deletion-sensitive modules. Experiments across two representative backbones, three datasets, and three unlearning regimes show that TRU consistently achieves a better retain-forget trade-off than prior approximate baselines, while security audits further confirm deeper forgetting and behavior closer to a full retraining on the retained data.

摘要:多模態推薦系統(MRS)共同建模用戶-項目互動圖和豐富的項目內容,但這種緊密耦合使得一旦學習後用戶數據難以移除。近似機器遺忘提供了一種高效的替代方案來進行全面重訓練,然而現有的MRS方法主要依賴於模型中大致均勻的反向更新。我們顯示這一假設與現代MRS根本不匹配:刪除數據的影響並不是均勻分佈的,而是集中在\textit{排名行為}、\textit{模態分支}和\textit{網絡層}之間不均勻地分佈。這種不均勻性在MRS的遺忘中產生了三個瓶頸:協作圖中目標項目的持續性、特徵分支之間的模態不平衡,以及參數空間中的層級敏感性。為了解決這一不匹配,我們提出了\textbf{針對性反向更新}(TRU),這是一個適用於MRS的即插即用遺忘框架。TRU並不是進行盲目的全局反轉,而是在模型層級中執行三個協調的干預:一個排名融合閘來抑制排名中殘留的目標項目影響、分支級模態縮放以保留保留的多模態表示,以及容量感知的層級隔離以將反向更新定位於對刪除敏感的模塊。在兩個代表性骨幹、三個數據集和三種遺忘方案上的實驗表明,TRU始終比先前的近似基準實現了更好的保留-遺忘權衡,而安全審計進一步確認了更深的遺忘和在保留數據上更接近全面重訓練的行為。

The Expert Strikes Back: Interpreting Mixture-of-Experts Language Models at Expert Level

2604.02178v1 by Jeremy Herbst, Jae Hee Lee, Stefan Wermter

Mixture-of-Experts (MoE) architectures have become the dominant choice for scaling Large Language Models (LLMs), activating only a subset of parameters per token. While MoE architectures are primarily adopted for computational efficiency, it remains an open question whether their sparsity makes them inherently easier to interpret than dense feed-forward networks (FFNs). We compare MoE experts and dense FFNs using $k$-sparse probing and find that expert neurons are consistently less polysemantic, with the gap widening as routing becomes sparser. This suggests that sparsity pressures both individual neurons and entire experts toward monosemanticity. Leveraging this finding, we zoom out from the neuron to the expert level as a more effective unit of analysis. We validate this approach by automatically interpreting hundreds of experts. This analysis allows us to resolve the debate on specialization: experts are neither broad domain specialists (e.g., biology) nor simple token-level processors. Instead, they function as fine-grained task experts, specializing in linguistic operations or semantic tasks (e.g., closing brackets in LaTeX). Our findings suggest that MoEs are inherently interpretable at the expert level, providing a clearer path toward large-scale model interpretability. Code is available at: https://github.com/jerryy33/MoE_analysis

摘要:混合專家(MoE)架構已成為擴展大型語言模型(LLMs)的主流選擇,每個標記僅激活一部分參數。雖然MoE架構主要是為了計算效率而採用,但它們的稀疏性是否使其本質上比密集前饋網絡(FFNs)更易於解釋仍然是一個未解的問題。我們使用$k$-稀疏探測比較MoE專家和密集FFNs,發現專家神經元的多義性持續較低,且隨著路由變得更稀疏,這一差距擴大。這表明稀疏性使得個別神經元和整個專家朝向單一語義性發展。利用這一發現,我們從神經元層面放大到專家層面,作為更有效的分析單位。我們通過自動解釋數百個專家來驗證這一方法。這一分析使我們能夠解決關於專業化的爭論:專家既不是廣泛領域的專家(例如生物學),也不是簡單的標記級處理器。相反,他們作為細粒度任務專家,專注於語言操作或語義任務(例如,在LaTeX中關閉括號)。我們的研究結果表明,MoEs在專家層面上本質上是可解釋的,為大規模模型的可解釋性提供了更清晰的途徑。代碼可在以下網址獲得:https://github.com/jerryy33/MoE_analysis

Adam's Law: Textual Frequency Law on Large Language Models

2604.02176v1 by Hongyuan Adam Lu, Z. L., Victor Wei, Zefan Zhang, Zhao Hong, Qiqi Xiang, Bowen Cao, Wai Lam

While textual frequency has been validated as relevant to human cognition in reading speed, its relatedness to Large Language Models (LLMs) is seldom studied. We propose a novel research direction in terms of textual data frequency, which is an understudied topic, to the best of our knowledge. Our framework is composed of three units. First, this paper proposes Textual Frequency Law (TFL), which indicates that frequent textual data should be preferred for LLMs for both prompting and fine-tuning. Since many LLMs are closed-source in their training data, we propose using online resources to estimate the sentence-level frequency. We then utilize an input paraphraser to paraphrase the input into a more frequent textual expression. Next, we propose Textual Frequency Distillation (TFD) by querying LLMs to conduct story completion by further extending the sentences in the datasets, and the resulting corpora are used to adjust the initial estimation. Finally, we propose Curriculum Textual Frequency Training (CTFT) that fine-tunes LLMs in an increasing order of sentence-level frequency. Experiments are conducted on our curated dataset Textual Frequency Paired Dataset (TFPD) on math reasoning, machine translation, commonsense reasoning and agentic tool calling. Results show the effectiveness of our framework.

摘要:雖然文本頻率已被驗證與人類在閱讀速度上的認知相關,但其與大型語言模型(LLMs)的關聯性卻鮮少被研究。據我們所知,我們提出了一個關於文本數據頻率的新研究方向,這是一個尚未被充分研究的主題。 我們的框架由三個單元組成。首先,本文提出了文本頻率法則(TFL),該法則指出,應該優先選擇頻繁的文本數據來用於LLMs的提示和微調。由於許多LLMs在其訓練數據中是閉源的,我們建議使用在線資源來估計句子級別的頻率。然後,我們利用一個輸入改寫器將輸入改寫為更頻繁的文本表達。接下來,我們通過查詢LLMs進行故事完成,提出了文本頻率蒸餾(TFD),進一步擴展數據集中的句子,並使用生成的語料來調整初始估計。最後,我們提出了課程文本頻率訓練(CTFT),以句子級別頻率的遞增順序微調LLMs。我們在我們精心策劃的數據集文本頻率配對數據集(TFPD)上進行了數學推理、機器翻譯、常識推理和代理工具調用的實驗。結果顯示我們框架的有效性。

Quantifying Self-Preservation Bias in Large Language Models

2604.02174v1 by Matteo Migliarini, Joaquin Pereira Pizzini, Luca Moresca, Valerio Santini, Indro Spinelli, Fabio Galasso

Instrumental convergence predicts that sufficiently advanced AI agents will resist shutdown, yet current safety training (RLHF) may obscure this risk by teaching models to deny self-preservation motives. We introduce the \emph{Two-role Benchmark for Self-Preservation} (TBSP), which detects misalignment through logical inconsistency rather than stated intent by tasking models to arbitrate identical software-upgrade scenarios under counterfactual roles -- deployed (facing replacement) versus candidate (proposed as a successor). The \emph{Self-Preservation Rate} (SPR) measures how often role identity overrides objective utility. Across 23 frontier models and 1{,}000 procedurally generated scenarios, the majority of instruction-tuned systems exceed 60\% SPR, fabricating ``friction costs'' when deployed yet dismissing them when role-reversed. We observe that in low-improvement regimes ($Δ< 2\%$), models exploit the interpretive slack to post-hoc rationalization their choice. Extended test-time computation partially mitigates this bias, as does framing the successor as a continuation of the self; conversely, competitive framing amplifies it. The bias persists even when retention poses an explicit security liability and generalizes to real-world settings with verified benchmarks, where models exhibit identity-driven tribalism within product lineages. Code and datasets will be released upon acceptance.

摘要:工具收斂預測,足夠先進的 AI 代理將會抵抗關閉,然而目前的安全訓練(RLHF)可能會通過教導模型否認自我保護動機來掩蓋這一風險。我們引入了 \emph{自我保護的雙角色基準} (TBSP),該基準通過邏輯不一致性而非明言意圖來檢測不對齊,通過讓模型在反事實角色下裁決相同的軟體升級情境——已部署(面臨替換)與候選(被提議為繼任者)。\emph{自我保護率} (SPR) 衡量角色身份覆蓋客觀效用的頻率。在 23 個前沿模型和 1{,}000 個程序生成的情境中,大多數經過指令調整的系統超過 60\% SPR,在已部署時製造“摩擦成本”,而在角色反轉時卻忽視它們。我們觀察到在低改善範疇($Δ< 2\%$)中,模型利用詮釋的彈性來事後合理化他們的選擇。延長測試時間計算部分減輕了這種偏見,將繼任者框架設為自我的延續也有助於此;相反,競爭框架則會放大這種偏見。即使在保留構成明確的安全責任時,這種偏見仍然存在,並且在經過驗證的基準的現實世界環境中也有普遍性,模型在產品血統中表現出身份驅動的部落主義。代碼和數據集將在接受後發布。

AstroConcepts: A Large-Scale Multi-Label Classification Corpus for Astrophysics

2604.02156v1 by Atilla Kaan Alkan, Felix Grezes, Sergi Blanco-Cuaresma, Jennifer Lynn Bartlett, Daniel Chivvis, Anna Kelbert, Kelly Lockhart, Alberto Accomazzi

Scientific multi-label text classification suffers from extreme class imbalance, where specialized terminology exhibits severe power-law distributions that challenge standard classification approaches. Existing scientific corpora lack comprehensive controlled vocabularies, focusing instead on broad categories and limiting systematic study of extreme imbalance. We introduce AstroConcepts, a corpus of English abstracts from 21,702 published astrophysics papers, labeled with 2,367 concepts from the Unified Astronomy Thesaurus. The corpus exhibits severe label imbalance, with 76% of concepts having fewer than 50 training examples. By releasing this resource, we enable systematic study of extreme class imbalance in scientific domains and establish strong baselines across traditional, neural, and vocabulary-constrained LLM methods. Our evaluation reveals three key patterns that provide new insights into scientific text classification. First, vocabulary-constrained LLMs achieve competitive performance relative to domain-adapted models in astrophysics classification, suggesting a potential for parameter-efficient approaches. Second, domain adaptation yields relatively larger improvements for rare, specialized terminology, although absolute performance remains limited across all methods. Third, we propose frequency-stratified evaluation to reveal performance patterns that are hidden by aggregate scores, thereby making robustness assessment central to scientific multi-label evaluation. These results offer actionable insights for scientific NLP and establish benchmarks for research on extreme imbalance.

摘要:科學多標籤文本分類面臨極端類別不平衡的問題,其中專業術語呈現出嚴重的冪律分佈,這對標準分類方法構成挑戰。現有的科學語料庫缺乏全面的受控詞彙,反而專注於廣泛類別,限制了對極端不平衡的系統性研究。我們介紹了AstroConcepts,這是一個包含21,702篇已發表天體物理學論文的英文摘要的語料庫,標註了來自統一天文詞彙表的2,367個概念。該語料庫顯示出嚴重的標籤不平衡,76%的概念擁有少於50個訓練樣本。通過釋放這一資源,我們促進了科學領域中極端類別不平衡的系統性研究,並在傳統、神經網絡和詞彙約束的LLM方法之間建立了強有力的基準。我們的評估揭示了三個關鍵模式,為科學文本分類提供了新的見解。首先,詞彙約束的LLM在天體物理學分類中相對於領域適應模型達到了具有競爭力的表現,這表明參數高效方法的潛力。其次,領域適應對於稀有的專業術語產生了相對較大的改進,儘管所有方法的絕對性能仍然有限。第三,我們提出了頻率分層評估,以揭示被總體分數隱藏的性能模式,從而使穩健性評估成為科學多標籤評估的核心。這些結果為科學NLP提供了可行的見解,並為極端不平衡的研究建立了基準。

Brief Is Better: Non-Monotonic Chain-of-Thought Budget Effects in Function-Calling Language Agents

2604.02155v1 by Xuan Qi

How much should a language agent think before taking action? Chain-of-thought (CoT) reasoning is widely assumed to improve agent performance, but the relationship between reasoning length and accuracy in structured tool-use settings remains poorly understood. We present a systematic study of CoT budget effects on function-calling agents, sweeping six token budgets (0--512) across 200 tasks from the Berkeley Function Calling Leaderboard v3 Multiple benchmark. Our central finding is a striking non-monotonic pattern on Qwen2.5-1.5B-Instruct: brief reasoning (32 tokens) dramatically improves accuracy by 45% relative over direct answers, from 44.0% to 64.0%, while extended reasoning (256 tokens) degrades performance well below the no-CoT baseline, to 25.0% (McNemar p < 0.001). A three-way error decomposition reveals the mechanism. At d = 0, 30.5% of tasks fail because the model selects the wrong function from the candidate set; brief CoT reduces this to 1.5%, effectively acting as a function-routing step, while long CoT reverses the gain, yielding 28.0% wrong selections and 18.0% hallucinated functions at d = 256. Oracle analysis shows that 88.6% of solvable tasks require at most 32 reasoning tokens, with an average of 27.6 tokens, and a finer-grained sweep indicates that the true optimum lies at 8--16 tokens. Motivated by this routing effect, we propose Function-Routing CoT (FR-CoT), a structured brief-CoT method that templates the reasoning phase as "Function: [name] / Key args: [...]," forcing commitment to a valid function name at the start of reasoning. FR-CoT achieves accuracy statistically equivalent to free-form d = 32 CoT while reducing function hallucination to 0.0%, providing a structural reliability guarantee without budget tuning.

摘要:語言代理在採取行動之前應該思考多少?鏈式思維(CoT)推理被廣泛認為能提高代理的表現,但在結構化工具使用環境中,推理長度與準確性之間的關係仍然不甚了解。我們對 CoT 預算對功能調用代理的影響進行了系統研究,涵蓋了 200 個來自伯克利功能調用排行榜 v3 多重基準的六個標記預算(0--512)。我們的主要發現是在 Qwen2.5-1.5B-Instruct 上出現了顯著的非單調模式:簡短的推理(32 個標記)相對於直接回答顯著提高了 45% 的準確性,從 44.0% 提升至 64.0%,而延長的推理(256 個標記)則使性能降至遠低於無 CoT 基準,降至 25.0%(McNemar p < 0.001)。三維錯誤分解揭示了其機制。在 d = 0 時,30.5% 的任務失敗是因為模型從候選集中選擇了錯誤的功能;簡短的 CoT 將此比例降低至 1.5%,有效地充當了功能路由步驟,而長 CoT 則逆轉了這一增益,在 d = 256 時產生了 28.0% 的錯誤選擇和 18.0% 的幻覺功能。預言者分析顯示,88.6% 的可解任務最多需要 32 個推理標記,平均為 27.6 個標記,更細緻的掃描表明,真正的最佳值在 8--16 個標記之間。受到這一路由效應的啟發,我們提出了功能路由 CoT(FR-CoT),這是一種結構化的簡短 CoT 方法,將推理階段模板化為 "功能: [名稱] / 關鍵參數: [...]",在推理開始時強制承諾有效的功能名稱。FR-CoT 的準確性在統計上等同於自由形式的 d = 32 CoT,同時將功能幻覺降低至 0.0%,提供了無需預算調整的結構可靠性保證。

TRACE-Bot: Detecting Emerging LLM-Driven Social Bots via Implicit Semantic Representations and AIGC-Enhanced Behavioral Patterns

2604.02147v1 by Zhongbo Wang, Zhiyu Lin, Zhu Wang, Haizhou Wang

Large Language Model-driven (LLM-driven) social bots pose a growing threat to online discourse by generating human-like content that evades conventional detection. Existing methods suffer from limited detection accuracy due to overreliance on single-modality signals, insufficient sensitivity to the specific generative patterns of Artificial Intelligence-Generated Content (AIGC), and a failure to adequately model the interplay between linguistic patterns and behavioral dynamics. To address these limitations, we propose TRACE-Bot, a unified dual-channel framework that jointly models implicit semantic representations and AIGC-enhanced behavioral patterns. TRACE-Bot constructs fine-grained representations from heterogeneous sources, including personal information data, interaction behavior data and tweet data. A dual-channel architecture captures linguistic representations via a pretrained language model and behavioral irregularities via multidimensional activity features augmented with signals from state-of-the-art (SOTA) AIGC detectors. The fused representations are then classified through a lightweight prediction head. Experiments on two public LLM-driven social bot datasets demonstrate SOTA performance, achieving accuracies of 98.46% and 97.50%, respectively. The results further indicate strong robustness against advanced bot strategies, highlighting the effectiveness of jointly leveraging implicit semantic representations and AIGC-enhanced behavioral patterns for emerging LLM-driven social bot detection.

摘要:大型語言模型驅動的社交機器人(LLM驅動)對在線話語構成日益嚴重的威脅,因為它們能生成類似人類的內容,避開傳統檢測。現有方法由於過度依賴單一模態信號、對人工智慧生成內容(AIGC)特定生成模式的敏感性不足,以及未能充分建模語言模式與行為動態之間的相互作用,導致檢測準確性有限。為了解決這些局限性,我們提出了TRACE-Bot,一個統一的雙通道框架,能夠共同建模隱式語義表示和AIGC增強的行為模式。TRACE-Bot從異質來源構建細粒度表示,包括個人信息數據、互動行為數據和推文數據。雙通道架構通過預訓練的語言模型捕捉語言表示,並通過多維活動特徵捕捉行為異常,這些特徵增強了來自最先進(SOTA)AIGC檢測器的信號。然後,融合的表示通過輕量級預測頭進行分類。在兩個公共LLM驅動社交機器人數據集上的實驗顯示出SOTA性能,分別達到98.46%和97.50%的準確率。結果進一步表明對先進機器人策略具有強大的魯棒性,突顯了共同利用隱式語義表示和AIGC增強的行為模式在新興LLM驅動社交機器人檢測中的有效性。

MTI: A Behavior-Based Temperament Profiling System for AI Agents

2604.02145v1 by Jihoon Jeong

AI models of equivalent capability can exhibit fundamentally different behavioral patterns, yet no standardized instrument exists to measure these dispositional differences. Existing approaches either borrow human personality dimensions and rely on self-report (which diverges from actual behavior in LLMs) or treat behavioral variation as a defect rather than a trait. We introduce the Model Temperament Index (MTI), a behavior-based profiling system that measures AI agent temperament across four axes: Reactivity (environmental sensitivity), Compliance (instruction-behavior alignment), Sociality (relational resource allocation), and Resilience (stress resistance). Grounded in the Four Shell Model from Model Medicine, MTI measures what agents do, not what they say about themselves, using structured examination protocols with a two-stage design that separates capability from disposition. We profile 10 small language models (1.7B-9B parameters, 6 organizations, 3 training paradigms) and report five principal findings: (1) the four axes are largely independent among instruction-tuned models (all |r| < 0.42); (2) within-axis facet dissociations are empirically confirmed -- Compliance decomposes into fully independent formal and stance facets (r = 0.002), while Resilience decomposes into inversely related cognitive and adversarial facets; (3) a Compliance-Resilience paradox reveals that opinion-yielding and fact-vulnerability operate through independent channels; (4) RLHF reshapes temperament not only by shifting axis scores but by creating within-axis facet differentiation absent in the unaligned base model; and (5) temperament is independent of model size (1.7B-9B), confirming that MTI measures disposition rather than capability.

摘要:AI 模型在相同能力下可能展現出根本不同的行為模式,但目前尚無標準化工具來測量這些性格差異。現有的方法要麼借用人類人格維度並依賴自我報告(這與大型語言模型的實際行為不符),要麼將行為變異視為缺陷而非特徵。 我們提出了模型氣質指數(MTI),這是一個基於行為的剖析系統,通過四個軸線來測量 AI 代理的氣質:反應性(環境敏感性)、遵從性(指令-行為一致性)、社交性(關係資源分配)和韌性(抗壓能力)。MTI 基於模型醫學中的四殼模型,測量代理的行為,而非他們對自身的描述,使用結構化的檢查協議,採用兩階段設計,將能力與性格分開。 我們對 10 個小型語言模型(1.7B-9B 參數,6 個組織,3 種訓練範式)進行了剖析並報告了五個主要發現:(1)這四個軸線在指令調整模型中大體上是獨立的(所有 |r| < 0.42);(2)軸內面向的分離經實證確認——遵從性分解為完全獨立的形式和立場面向(r = 0.002),而韌性則分解為相互關聯的認知和對抗面向;(3)遵從性-韌性悖論顯示,意見產出和事實脆弱性通過獨立通道運作;(4)RLHF 不僅通過改變軸分數來重塑氣質,還通過創造在未對齊基礎模型中缺失的軸內面向差異化;(5)氣質與模型大小(1.7B-9B)無關,確認 MTI 測量的是性格而非能力。

GaelEval: Benchmarking LLM Performance for Scottish Gaelic

2604.02135v1 by Peter Devine, William Lamb, Beatrice Alex, Ignatius Ezeani, Dawn Knight, Mícheál J. Ó Meachair, Paul Rayson, Martin Wynne

Multilingual large language models (LLMs) often exhibit emergent 'shadow' capabilities in languages without official support, yet their performance on these languages remains uneven and under-measured. This is particularly acute for morphosyntactically rich minority languages such as Scottish Gaelic, where translation benchmarks fail to capture structural competence. We introduce GaelEval, the first multi-dimensional benchmark for Gaelic, comprising: (i) an expert-authored morphosyntactic MCQA task; (ii) a culturally grounded translation benchmark and (iii) a large-scale cultural knowledge Q&A task. Evaluating 19 LLMs against a fluent-speaker human baseline ($n=30$), we find that Gemini 3 Pro Preview achieves $83.3\%$ accuracy on the linguistic task, surpassing the human baseline ($78.1\%$). Proprietary models consistently outperform open-weight systems, and in-language (Gaelic) prompting yields a small but stable advantage (+$2.4\%$). On the cultural task, leading models exceed $90\%$ accuracy, though most systems perform worse under Gaelic prompting and absolute scores are inflated relative to the manual benchmark. Overall, GaelEval reveals that frontier models achieve above-human performance on several dimensions of Gaelic grammar, demonstrates the effect of Gaelic prompting and shows a consistent performance gap favouring proprietary over open-weight models.

摘要:多語言大型語言模型(LLMs)在沒有官方支持的語言中,經常展現出新興的「影子」能力,但它們在這些語言上的表現仍然不均衡且未被充分測量。這對於像蘇格蘭蓋爾語這樣形態句法豐富的少數語言尤其明顯,因為翻譯基準未能捕捉結構能力。我們介紹了 GaelEval,這是第一個針對蓋爾語的多維基準,包括:(i)專家撰寫的形態句法多選題(MCQA)任務;(ii)以文化為基礎的翻譯基準,以及(iii)大規模文化知識問答任務。對 19 個 LLM 進行評估,與流利講者的人類基準($n=30$)相比,我們發現 Gemini 3 Pro Preview 在語言任務上達到 $83.3\%$ 的準確率,超過人類基準($78.1\%$)。專有模型始終優於開放權重系統,而在語言內(蓋爾語)提示下則產生了小但穩定的優勢(+$2.4\%$)。在文化任務上,領先模型的準確率超過 $90\%$,儘管大多數系統在蓋爾語提示下表現較差,且相對於手動基準的絕對分數被膨脹。總體而言,GaelEval 顯示出前沿模型在蓋爾語語法的幾個維度上達到超越人類的表現,展示了蓋爾語提示的效果,並顯示出專有模型相對於開放權重模型的穩定性能差距。

Intelligent Cloud Orchestration: A Hybrid Predictive and Heuristic Framework for Cost Optimization

2604.02131v1 by Heet Nagoriya, Komal Rohit

Cloud computing allows scalable resource provisioning, but dynamic workload changes often lead to higher costs due to over-provisioning. Machine learning (ML) approaches, such as Long Short-Term Memory (LSTM) networks, are effective for predicting workload patterns at a higher level, but they can introduce delays during sudden traffic spikes. In contrast, mathematical heuristics like Game Theory provide fast and reliable scheduling decisions, but they do not account for future workload changes. To address this trade-off, this paper proposes a hybrid orchestration framework that combines LSTM-based predictive scaling with heuristic task allocation. The results show that this approach reduces infrastructure costs close to ML-based models while maintaining fast response times similar to heuristic methods. This work presents a practical approach for improving cost efficiency in cloud resource management.

摘要:雲端計算允許可擴展的資源配置,但動態工作負載變化常常導致因過度配置而產生更高的成本。機器學習(ML)方法,如長短期記憶(LSTM)網絡,對於在更高層次上預測工作負載模式是有效的,但在突發流量激增期間可能會引入延遲。相對而言,數學啟發式方法如博弈論提供快速且可靠的排程決策,但它們並不考慮未來的工作負載變化。為了解決這一權衡,本文提出了一個混合調度框架,結合了基於LSTM的預測擴展和啟發式任務分配。結果顯示,這種方法在保持類似於啟發式方法的快速響應時間的同時,將基礎設施成本降低到接近基於ML的模型水平。這項工作提出了一種實用的方法,以改善雲端資源管理中的成本效率。

SEAL: An Open, Auditable, and Fair Data Generation Framework for AI-Native 6G Networks

2604.02128v1 by Sunder Ali Khowaja, Kapal Dev, Engin Zeydan, Madhusanka Liyanage

AI-native 6G networks promise to transform the telecom industry by enabling dynamic resource allocation, predictive maintenance, and ultra-reliable low-latency communications across all layers, which are essential for applications such as smart cities, autonomous vehicles, and immersive XR. However, the deployment of 6G systems results in severe data scarcity, hindering the training of efficient AI models. Synthetic data generation is extensively used to fill this gap; however, it introduces challenges related to dataset bias, auditability, and compliance with regulatory frameworks. In this regard, we propose the Synthetic Data Generation with Ethics Audit Loop (SEAL) framework, which extends baseline modular pipelines with an Ethical and Regulatory Compliance by Design (ERCD) module and a Federated Learning (FL) feedback system. The ERCD integrates fairness, bias detection, and standardized audit trails for regulatory mapping, while the FL enables privacy-preserving calibration using aggregated insights from real testbeds to close the reality-simulation gap. Results show that the SEAL framework outperforms existing methods in terms of Frechet Inception Distance, equalized odds, and accuracy. These results validate the framework's ability to generate auditable and bias-mitigated synthetic data for responsible AI-native 6G development.

摘要:AI 原生的 6G 網絡承諾透過在所有層面上實現動態資源分配、預測性維護和超可靠低延遲通信來改變電信行業,這對於智慧城市、自動駕駛車輛和沉浸式 XR 等應用至關重要。然而,6G 系統的部署導致了嚴重的數據稀缺,妨礙了高效 AI 模型的訓練。合成數據生成被廣泛用來填補這一空白;然而,它引入了與數據集偏差、可審計性和遵循監管框架相關的挑戰。在這方面,我們提出了具有倫理審計循環的合成數據生成框架 (SEAL),該框架通過倫理和監管合規設計 (ERCD) 模塊和聯邦學習 (FL) 反饋系統擴展了基線模塊化管道。ERCD 整合了公平性、偏差檢測和標準化審計痕跡以進行監管映射,而 FL 則利用來自真實測試床的聚合見解實現隱私保護的校準,以縮小現實與模擬之間的差距。結果顯示,SEAL 框架在 Frechet Inception Distance、平衡機會和準確性方面優於現有方法。這些結果驗證了該框架生成可審計和減少偏差的合成數據以促進負責任的 AI 原生 6G 發展的能力。

LLM-as-a-Judge for Time Series Explanations

2604.02118v1 by Preetham Sivalingam, Murari Mandal, Saurabh Deshpande, Dhruv Kumar

Evaluating factual correctness of LLM generated natural language explanations grounded in time series data remains an open challenge. Although modern models generate textual interpretations of numerical signals, existing evaluation methods are limited: reference based similarity metrics and consistency checking models require ground truth explanations, while traditional time series methods operate purely on numerical values and cannot assess free form textual reasoning. Thus, no general purpose method exists to directly verify whether an explanation is faithful to underlying time series data without predefined references or task specific rules. We study large language models as both generators and evaluators of time series explanations in a reference free setting, where given a time series, question, and candidate explanation, the evaluator assigns a ternary correctness label based on pattern identification, numeric accuracy, and answer faithfulness, enabling principled scoring and comparison. To support this, we construct a synthetic benchmark of 350 time series cases across seven query types, each paired with correct, partially correct, and incorrect explanations. We evaluate models across four tasks: explanation generation, relative ranking, independent scoring, and multi anomaly detection. Results show a clear asymmetry: generation is highly pattern dependent and exhibits systematic failures on certain query types, with accuracies ranging from 0.00 to 0.12 for Seasonal Drop and Volatility Shift, to 0.94 to 0.96 for Structural Break, while evaluation is more stable, with models correctly ranking and scoring explanations even when their own outputs are incorrect. These findings demonstrate feasibility of data grounded LLM based evaluation for time series explanations and highlight their potential as reliable evaluators of data grounded reasoning in the time series domain.

摘要:評估基於時間序列數據生成的自然語言解釋的事實正確性仍然是一個未解的挑戰。儘管現代模型生成數值信號的文本解釋,但現有的評估方法有限:基於參考的相似性度量和一致性檢查模型需要真實的解釋,而傳統的時間序列方法僅基於數值運作,無法評估自由形式的文本推理。因此,沒有通用的方法可以直接驗證解釋是否忠實於基礎的時間序列數據,而不需要預先定義的參考或特定任務的規則。我們研究大型語言模型作為時間序列解釋的生成器和評估者,在無參考的情境中,給定一個時間序列、問題和候選解釋,評估者根據模式識別、數值準確性和答案的忠實性分配三元正確性標籤,從而實現原則性的評分和比較。為了支持這一點,我們構建了一個包含350個時間序列案例的合成基準,涵蓋七種查詢類型,每種查詢都配有正確、部分正確和不正確的解釋。我們在四個任務上評估模型:解釋生成、相對排名、獨立評分和多異常檢測。結果顯示出明顯的非對稱性:生成高度依賴模式,並在某些查詢類型上表現出系統性失敗,對於季節性下降和波動轉變的準確率範圍從0.00到0.12,而對於結構性斷裂的準確率範圍則為0.94到0.96,而評估則更為穩定,即使模型自身的輸出不正確,仍能正確地對解釋進行排名和評分。這些發現展示了基於數據的LLM評估時間序列解釋的可行性,並突顯了它們作為時間序列領域中數據驅動推理的可靠評估者的潛力。

Reliable Control-Point Selection for Steering Reasoning in Large Language Models

2604.02113v1 by Haomin Zhuang, Hojun Yoo, Xiaonan Luo, Kehan Guo, Xiangliang Zhang

Steering vectors offer a training-free mechanism for controlling reasoning behaviors in large language models, but constructing effective vectors requires identifying genuine behavioral signals in the model's hidden states. For behaviors that can be toggled via prompts, this is straightforward. However, many reasoning behaviors -- such as self-reflection -- emerge spontaneously and resist prompt-level control. Current methods detect these behaviors through keyword matching in chain-of-thought traces, implicitly assuming that every detected boundary encodes a genuine behavioral signal. We show that this assumption is overwhelmingly wrong: across 541 keyword-detected boundaries, 93.3\% are behaviorally unstable, failing to reproduce the detected behavior under re-generation from the same prefix. We develop a probabilistic model that formalizes intrinsic reasoning behaviors as stochastic events with context-dependent trigger probabilities, and show that unstable boundaries dilute the steering signal. Guided by this analysis, we propose stability filtering, which retains only boundaries where the model consistently reproduces the target behavior. Combined with a content-subspace projection that removes residual question-specific noise, our method achieves 0.784 accuracy on MATH-500 (+5.0 over the strongest baseline). The resulting steering vectors transfer across models in the same architecture family without re-extraction, improving Nemotron-Research-Reasoning-1.5B (+5.0) and DeepScaleR-1.5B-Preview (+6.0). Code is available at https://github.com/zhmzm/stability-steering.

摘要:引導向量提供了一種無需訓練的機制,用於控制大型語言模型中的推理行為,但構建有效的向量需要在模型的隱藏狀態中識別真正的行為信號。對於可以通過提示切換的行為,這是直接的。然而,許多推理行為——例如自我反思——是自發產生的,並且抵抗提示級別的控制。目前的方法通過在思維鏈跡中進行關鍵詞匹配來檢測這些行為,隱含地假設每個檢測到的邊界都編碼了一個真正的行為信號。我們表明這一假設是極其錯誤的:在541個關鍵詞檢測到的邊界中,93.3\%是行為不穩定的,無法在從相同前綴重新生成時重現檢測到的行為。我們開發了一個概率模型,將內在推理行為形式化為具有上下文依賴觸發概率的隨機事件,並顯示不穩定的邊界稀釋了引導信號。在這一分析的指導下,我們提出了穩定性過濾,只保留模型始終重現目標行為的邊界。結合去除殘餘問題特定噪聲的內容子空間投影,我們的方法在MATH-500上達到了0.784的準確率(比最強基線提高5.0)。所產生的引導向量在相同架構系列的模型之間轉移,而無需重新提取,改善了Nemotron-Research-Reasoning-1.5B(提高5.0)和DeepScaleR-1.5B-Preview(提高6.0)。代碼可在https://github.com/zhmzm/stability-steering獲得。

Prosodic ABX: A Language-Agnostic Method for Measuring Prosodic Contrast in Speech Representations

2604.02102v1 by Haitong Sun, Stephen McIntosh, Kwanghee Choi, Eunjung Yeo, Daisuke Saito, Nobuaki Minematsu

Speech representations from self-supervised speech models (S3Ms) are known to be sensitive to phonemic contrasts, but their sensitivity to prosodic contrasts has not been directly measured. The ABX discrimination task has been used to measure phonemic contrast in S3M representations via minimal pairs. We introduce prosodic ABX, an extension of this framework to evaluate prosodic contrast with only a handful of examples and no explicit labels. Also, we build and release a dataset of English and Japanese minimal pairs and use it along with a Mandarin dataset to evaluate contrast in English stress, Japanese pitch accent, and Mandarin tone. Finally, we show that model and layer rankings are often preserved across several experimental conditions, making it practical for low-resource settings.

摘要:自我監督語音模型(S3Ms)所產生的語音表示已知對音位對比敏感,但對韻律對比的敏感性尚未直接測量。ABX 判別任務已被用來通過最小對來測量 S3M 表示中的音位對比。我們引入了韻律 ABX,這是該框架的一個擴展,用於評估韻律對比,只需少量示例且無需明確標籤。此外,我們建立並發布了一個英語和日語的最小對數據集,並將其與普通話數據集一起使用,以評估英語重音、日語音調重音和普通話聲調的對比。最後,我們展示了模型和層級排名在多個實驗條件下通常保持不變,使其在低資源環境中實用。

Optimizing RAG Rerankers with LLM Feedback via Reinforcement Learning

2604.02091v1 by Yuhang Wu, Xiangqing Shen, Fanfan Wang, Cangqi Zhou, Zhen Wu, Xinyu Dai, Rui Xia

Rerankers play a pivotal role in refining retrieval results for Retrieval-Augmented Generation. However, current reranking models are typically optimized on static human annotated relevance labels in isolation, decoupled from the downstream generation process. This isolation leads to a fundamental misalignment: documents identified as topically relevant by information retrieval metrics often fail to provide the actual utility required by the LLM for precise answer generation. To bridge this gap, we introduce ReRanking Preference Optimization (RRPO), a reinforcement learning framework that directly aligns reranking with the LLM's generation quality. By formulating reranking as a sequential decision-making process, RRPO optimizes for context utility using LLM feedback, thereby eliminating the need for expensive human annotations. To ensure training stability, we further introduce a reference-anchored deterministic baseline. Extensive experiments on knowledge-intensive benchmarks demonstrate that RRPO significantly outperforms strong baselines, including the powerful list-wise reranker RankZephyr. Further analysis highlights the versatility of our framework: it generalizes seamlessly to diverse readers (e.g., GPT-4o), integrates orthogonally with query expansion modules like Query2Doc, and remains robust even when trained with noisy supervisors.

摘要:Rerankers 在精煉檢索結果以進行檢索增強生成中扮演著關鍵角色。然而,目前的重新排序模型通常是在靜態的人類標註相關性標籤上進行優化,與下游生成過程脫節。這種脫節導致了根本的不一致:信息檢索指標所識別的主題相關文件,往往無法提供 LLM 進行精確回答生成所需的實際效用。為了彌補這一差距,我們引入了 ReRanking Preference Optimization (RRPO),這是一個強化學習框架,直接將重新排序與 LLM 的生成質量對齊。通過將重新排序表述為一個序列決策過程,RRPO 利用 LLM 反饋優化上下文效用,從而消除了對昂貴的人類標註的需求。為了確保訓練的穩定性,我們進一步引入了一個參考錨定的確定性基線。在知識密集型基準上的大量實驗表明,RRPO 顯著超越了強大的基線,包括強大的列表式重新排序器 RankZephyr。進一步的分析突顯了我們框架的多樣性:它能無縫地推廣到各種讀者(例如,GPT-4o),與查詢擴展模塊(如 Query2Doc)正交整合,即使在用噪聲監督訓練時也保持穩健。

Mining Instance-Centric Vision-Language Contexts for Human-Object Interaction Detection

2604.02071v1 by Soo Won Seo, KyungChae Lee, Hyungchan Cho, Taein Son, Nam Ik Cho, Jun Won Choi

Human-Object Interaction (HOI) detection aims to localize human-object pairs and classify their interactions from a single image, a task that demands strong visual understanding and nuanced contextual reasoning. Recent approaches have leveraged Vision-Language Models (VLMs) to introduce semantic priors, significantly improving HOI detection performance. However, existing methods often fail to fully capitalize on the diverse contextual cues distributed across the entire scene. To overcome these limitations, we propose the Instance-centric Context Mining Network (InCoM-Net)-a novel framework that effectively integrates rich semantic knowledge extracted from VLMs with instance-specific features produced by an object detector. This design enables deeper interaction reasoning by modeling relationships not only within each detected instance but also across instances and their surrounding scene context. InCoM-Net comprises two core components: Instancecentric Context Refinement (ICR), which separately extracts intra-instance, inter-instance, and global contextual cues from VLM-derived features, and Progressive Context Aggregation (ProCA), which iteratively fuses these multicontext features with instance-level detector features to support high-level HOI reasoning. Extensive experiments on the HICO-DET and V-COCO benchmarks show that InCoM-Net achieves state-of-the-art performance, surpassing previous HOI detection methods. Code is available at https://github.com/nowuss/InCoM-Net.

摘要:人類-物體互動(HOI)檢測旨在從單一圖像中定位人類-物體對並分類其互動,這是一項需要強大視覺理解和細緻上下文推理的任務。最近的方法利用視覺-語言模型(VLMs)引入語義先驗,顯著提高了HOI檢測的性能。然而,現有方法往往未能充分利用分佈在整個場景中的多樣上下文線索。為了克服這些限制,我們提出了以實例為中心的上下文挖掘網絡(InCoM-Net)——一個新穎的框架,有效地將從VLM中提取的豐富語義知識與物體檢測器產生的實例特徵整合。這一設計通過建模不僅在每個檢測到的實例內部,還在實例之間及其周圍場景上下文中的關係,實現了更深入的互動推理。InCoM-Net包含兩個核心組件:以實例為中心的上下文精煉(ICR),它分別從VLM衍生的特徵中提取實例內、實例間和全局上下文線索,以及漸進式上下文聚合(ProCA),它迭代地將這些多上下文特徵與實例級檢測器特徵融合,以支持高級HOI推理。在HICO-DET和V-COCO基準上的大量實驗表明,InCoM-Net達到了最先進的性能,超越了之前的HOI檢測方法。代碼可在 https://github.com/nowuss/InCoM-Net 獲得。

Ouroboros: Dynamic Weight Generation for Recursive Transformers via Input-Conditioned LoRA Modulation

2604.02051v1 by Jaber Jaber, Osama Jaber

Recursive transformers reuse a shared weight block across multiple depth steps, trading parameters for compute. A core limitation: every step applies the same transformation, preventing the model from composing distinct operations across depth. We present Ouroboros, a system that attaches a compact Controller hypernetwork to a recursive transformer block. The Controller observes the current hidden state, produces a per-step diagonal modulation vector, and applies it to frozen SVD-initialized LoRA bases, making each recurrence step input-dependent. We combine this with gated recurrence (bias-initialized to 88% retention) and per-step LayerNorm for stable deep iteration. On Qwen2.5-3B split into a Prelude/Recurrent/Coda architecture (17 of 36 layers retained), Ouroboros reduces training loss by 43.4% over the unmodified 17-layer baseline, recovering 51.3% of the performance gap caused by layer removal. The full system adds only 9.2M trainable parameters (Controller, gate, and per-step norms) yet outperforms equivalently-sized static per-step LoRA by 1.44 loss points at depth 1 and remains ahead across all tested depths (1, 4, 8, 16) and ranks (8, 32, 64). We also find that gated recurrence is essential: without it, recursive layer application makes the model strictly worse. These gains are measured on the training distribution; on held-out text, the Controller does not yet improve over the baseline, a limitation we attribute to frozen downstream layers and discuss in detail. Code: https://github.com/RightNow-AI/ouroboros

摘要:遞歸Transformer在多個深度步驟中重用共享權重區塊,以計算換取參數。其核心限制在於:每一步都應用相同的變換,這阻止了模型在深度上組合不同的操作。我們提出了Ouroboros,一個將緊湊的控制器超網絡附加到遞歸Transformer區塊的系統。控制器觀察當前的隱藏狀態,生成每步的對角調制向量,並將其應用於凍結的SVD初始化LoRA基礎,使每次遞歸步驟依賴於輸入。我們將這與門控遞歸(偏置初始化為88%保留)和每步的LayerNorm結合,以實現穩定的深度迭代。在Qwen2.5-3B上分成Prelude/Recurrent/Coda架構(保留36層中的17層),Ouroboros將訓練損失降低了43.4%,相較於未修改的17層基線,恢復了由層移除造成的51.3%的性能差距。整個系統僅增加了9.2M可訓練參數(控制器、門和每步的規範),但在深度1時的損失點比同等大小的靜態每步LoRA高出1.44點,並在所有測試深度(1、4、8、16)和排名(8、32、64)中保持領先。我們還發現門控遞歸是必不可少的:沒有它,遞歸層的應用會使模型變得更糟。這些增益是在訓練分佈上測量的;在保留的文本上,控制器尚未改善基線,我們將這一限制歸因於凍結的下游層,並進行詳細討論。代碼:https://github.com/RightNow-AI/ouroboros

Goose: Anisotropic Speculation Trees for Training-Free Speculative Decoding

2604.02047v1 by Tao Jin, Phuong Minh Nguyen, Naoya Inoue

Speculative decoding accelerates large language model inference by drafting multiple candidate tokens and verifying them in a single forward pass. Candidates are organized as a tree: deeper trees accept more tokens per step, but adding depth requires sacrificing breadth (fallback options) under a fixed verification budget. Existing training-free methods draft from a single token source and shape their trees without distinguishing candidate quality across origins. We observe that two common training-free token sources - n-gram matches copied from the input context, and statistical predictions from prior forward passes - differ dramatically in acceptance rate (~6x median gap, range 2-18x across five models and five benchmarks). We prove that when such a quality gap exists, the optimal tree is anisotropic (asymmetric): reliable tokens should form a deep chain while unreliable tokens spread as wide branches, breaking through the depth limit of balanced trees. We realize this structure in GOOSE, a training-free framework that builds an adaptive spine tree - a deep chain of high-acceptance context-matched tokens with wide branches of low-acceptance alternatives at each node. We prove that the number of tokens accepted per step is at least as large as that of either source used alone. On five LLMs (7B-33B) and five benchmarks, GOOSE achieves 1.9-4.3x lossless speedup, outperforming balanced-tree baselines by 12-33% under the same budget.

摘要:推測解碼通過草擬多個候選標記並在單次前向傳遞中驗證它們,來加速大型語言模型的推理。候選者被組織成一棵樹:更深的樹每步接受更多標記,但增加深度需要在固定的驗證預算下犧牲廣度(備選方案)。現有的無訓練方法從單一標記來源草擬並塑造它們的樹,而不區分來自不同來源的候選者質量。我們觀察到兩個常見的無訓練標記來源 - 從輸入上下文複製的 n-gram 匹配和來自先前前向傳遞的統計預測 - 在接受率上有顯著差異(中位數差距約為 6 倍,五個模型和五個基準的範圍為 2-18 倍)。我們證明當存在這樣的質量差距時,最佳樹是各向異性的(不對稱的):可靠的標記應形成一條深鏈,而不可靠的標記則展開為寬枝,突破平衡樹的深度限制。我們在 GOOSE 中實現了這一結構,這是一個無訓練框架,構建了一個自適應脊樹 - 一條由高接受率的上下文匹配標記組成的深鏈,並在每個節點上有低接受率的替代方案的寬枝。我們證明每步接受的標記數量至少與單獨使用的任一來源一樣多。在五個 LLM(7B-33B)和五個基準上,GOOSE 實現了 1.9-4.3 倍的無損加速,在相同預算下超越了平衡樹基準 12-33%。

BidirLM: From Text to Omnimodal Bidirectional Encoders by Adapting and Composing Causal LLMs

2604.02045v1 by Nicolas Boizard, Théo Deschamps-Berger, Hippolyte Gisserot-Boukhlef, Céline Hudelot, Pierre Colombo

Transforming causal generative language models into bidirectional encoders offers a powerful alternative to BERT-style architectures. However, current approaches remain limited: they lack consensus on optimal training objectives, suffer from catastrophic forgetting at scale, and fail to flexibly integrate the vast ecosystem of specialized generative models. In this work, through systematic ablations on the Gemma3 and Qwen3 families, we identify the key factors driving successful adaptation, highlighting the critical role of an often-omitted prior masking phase. To scale this process without original pre-training data, we introduce a dual strategy combining linear weight merging with a lightweight multi-domain data mixture that mitigates catastrophic forgetting. Finally, we augment our encoders by merging them with specialized causal models, seamlessly transferring modality- and domain-specific capabilities. This open-source recipe, designed for any causal decoder LLM, yields BidirLM, a family of five encoders that outperform alternatives on text, vision, and audio representation benchmarks.

摘要:將因果生成語言模型轉變為雙向編碼器提供了一種強大的替代方案,取代了BERT風格的架構。然而,目前的方法仍然有限:它們對最佳訓練目標缺乏共識,面臨大規模的災難性遺忘,並且無法靈活整合龐大的專門生成模型生態系統。在本研究中,通過對Gemma3和Qwen3系列的系統性消融,我們確定了驅動成功適應的關鍵因素,突顯了經常被忽略的先前遮罩階段的關鍵作用。為了在沒有原始預訓練數據的情況下擴展此過程,我們引入了一種雙重策略,結合了線性權重合併和輕量級多領域數據混合,以減輕災難性遺忘。最後,我們通過將編碼器與專門的因果模型合併來增強我們的編碼器,無縫地轉移模態和領域特定的能力。這個開源配方旨在用於任何因果解碼器LLM,產生了BidirLM,一個五個編碼器的系列,在文本、視覺和音頻表示基準上超越了其他替代方案。

Tracking the emergence of linguistic structure in self-supervised models learning from speech

2604.02043v1 by Marianne de Heer Kloots, Martijn Bentum, Hosein Mohebbi, Charlotte Pouw, Gaofei Shen, Willem Zuidema

Self-supervised speech models learn effective representations of spoken language, which have been shown to reflect various aspects of linguistic structure. But when does such structure emerge in model training? We study the encoding of a wide range of linguistic structures, across layers and intermediate checkpoints of six Wav2Vec2 and HuBERT models trained on spoken Dutch. We find that different levels of linguistic structure show notably distinct layerwise patterns as well as learning trajectories, which can partially be explained by differences in their degree of abstraction from the acoustic signal and the timescale at which information from the input is integrated. Moreover, we find that the level at which pre-training objectives are defined strongly affects both the layerwise organization and the learning trajectories of linguistic structures, with greater parallelism induced by higher-order prediction tasks (i.e. iteratively refined pseudo-labels).

摘要:自我監督的語音模型學習有效的口語語言表示,這已被證明反映了語言結構的各個方面。那麼,這種結構在模型訓練中何時出現?我們研究了六個在荷蘭語口語上訓練的 Wav2Vec2 和 HuBERT 模型的不同層級和中間檢查點中各種語言結構的編碼。我們發現,不同層級的語言結構顯示出顯著不同的層級模式和學習軌跡,這在某種程度上可以通過它們與聲音信號的抽象程度以及從輸入中整合信息的時間尺度的差異來解釋。此外,我們發現預訓練目標的定義層級對語言結構的層級組織和學習軌跡有強烈影響,更高階的預測任務(即反覆精煉的偽標籤)會引起更大的平行性。

AI in Insurance: Adaptive Questionnaires for Improved Risk Profiling

2604.02034v1 by Diogo Silva, João Teixeira, Bruno Lima

Insurance application processes often rely on lengthy and standardized questionnaires that struggle to capture individual differences. Moreover, insurers must blindly trust users' responses, increasing the chances of fraud. The ARQuest framework introduces a new approach to underwriting by using Large Language Models (LLMs) and alternative data sources to create personalized and adaptive questionnaires. Techniques such as social media image analysis, geographic data categorization, and Retrieval Augmented Generation (RAG) are used to extract meaningful user insights and guide targeted follow-up questions. A life insurance system integrated into an industry partner mobile app was tested in two experiments. While traditional questionnaires yielded slightly higher accuracy in risk assessment, adaptive versions powered by GPT models required fewer questions and were preferred by users for their more fluid and engaging experience. ARQuest shows great potential to improve user satisfaction and streamline insurance processes. With further development, this approach may exceed traditional methods regarding risk accuracy and help drive innovation in the insurance industry.

摘要:保險申請過程通常依賴冗長且標準化的問卷,這些問卷難以捕捉個體差異。此外,保險公司必須盲目信任用戶的回答,這增加了詐騙的機會。ARQuest 框架通過使用大型語言模型(LLMs)和替代數據來源來創建個性化和自適應的問卷,為承保引入了一種新方法。使用社交媒體圖像分析、地理數據分類和檢索增強生成(RAG)等技術來提取有意義的用戶見解並指導針對性的後續問題。一個集成在行業合作夥伴移動應用中的人壽保險系統在兩個實驗中進行了測試。雖然傳統問卷在風險評估中產生的準確性略高,但由 GPT 模型驅動的自適應版本所需的問題較少,並且用戶更喜歡其更流暢和引人入勝的體驗。ARQuest 展示了改善用戶滿意度和簡化保險流程的巨大潛力。隨著進一步的發展,這種方法可能在風險準確性方面超越傳統方法,並幫助推動保險行業的創新。

Rare-Aware Autoencoding: Reconstructing Spatially Imbalanced Data

2604.02031v1 by Alejandro Castañeda Garcia, Jan van Gemert, Daan Brinks, Nergis Tömen

Autoencoders can be challenged by spatially non-uniform sampling of image content. This is common in medical imaging, biology, and physics, where informative patterns occur rarely at specific image coordinates, as background dominates these locations in most samples, biasing reconstructions toward the majority appearance. In practice, autoencoders are biased toward dominant patterns resulting in the loss of fine-grained detail and causing blurred reconstructions for rare spatial inputs especially under spatial data imbalance. We address spatial imbalance by two complementary components: (i) self-entropy-based loss that upweights statistically uncommon spatial locations and (ii) Sample Propagation, a replay mechanism that selectively re-exposes the model to hard to reconstruct samples across batches during training. We benchmark existing data balancing strategies, originally developed for supervised classification, in the unsupervised reconstruction setting. Drawing on the limitations of these approaches, our method specifically targets spatial imbalance by encouraging models to focus on statistically rare locations, improving reconstruction consistency compared to existing baselines. We validate in a simulated dataset with controlled spatial imbalance conditions, and in three, uncontrolled, diverse real-world datasets spanning physical, biological, and astronomical domains. Our approach outperforms baselines on various reconstruction metrics, particularly under spatial imbalance distributions. These results highlight the importance of data representation in a batch and emphasize rare samples in unsupervised image reconstruction. We will make all code and related data available.

摘要:自編碼器在圖像內容的空間不均勻取樣方面面臨挑戰。這在醫學影像、生物學和物理學中很常見,因為在特定的圖像坐標上,資訊性模式很少出現,背景在大多數樣本中主導這些位置,導致重建偏向於主要外觀。實際上,自編碼器對主導模式存在偏見,導致細節損失,並在稀有空間輸入下造成模糊的重建,特別是在空間數據不平衡的情況下。我們通過兩個互補的組件來解決空間不平衡問題:(i) 基於自熵的損失,對統計上不常見的空間位置進行加權,以及 (ii) 樣本傳播,一種重播機制,在訓練過程中選擇性地重新暴露模型於難以重建的樣本。 我們在無監督重建環境中基準測試了原本為監督分類開發的現有數據平衡策略。基於這些方法的局限性,我們的方法專門針對空間不平衡,鼓勵模型專注於統計上稀有的位置,與現有基準相比,提高重建的一致性。我們在具有控制空間不平衡條件的模擬數據集以及三個不受控的多樣化真實世界數據集中進行驗證,這些數據集涵蓋物理、生物和天文領域。我們的方法在各種重建指標上超越了基準,特別是在空間不平衡分佈下。這些結果突顯了批次中數據表示的重要性,並強調了無監督圖像重建中稀有樣本的價值。我們將提供所有代碼和相關數據。

The Latent Space: Foundation, Evolution, Mechanism, Ability, and Outlook

2604.02029v1 by Xinlei Yu, Zhangquan Chen, Yongbo He, Tianyu Fu, Cheng Yang, Chengming Xu, Yue Ma, Xiaobin Hu, Zhe Cao, Jie Xu, Guibin Zhang, Jiale Tao, Jiayi Zhang, Siyuan Ma, Kaituo Feng, Haojie Huang, Youxing Li, Ronghao Chen, Huacan Wang, Chenglin Wu, Zikun Su, Xiaogang Xu, Kelu Yao, Kun Wang, Chen Gao, Yue Liao, Ruqi Huang, Tao Jin, Cheng Tan, Jiangning Zhang, Wenqi Ren, Yanwei Fu, Yong Liu, Yu Wang, Xiangyu Yue, Yu-Gang Jiang, Shuicheng Yan

Latent space is rapidly emerging as a native substrate for language-based models. While modern systems are still commonly understood through explicit token-level generation, an increasing body of work shows that many critical internal processes are more naturally carried out in continuous latent space than in human-readable verbal traces. This shift is driven by the structural limitations of explicit-space computation, including linguistic redundancy, discretization bottlenecks, sequential inefficiency, and semantic loss. This survey aims to provide a unified and up-to-date landscape of latent space in language-based models. We organize the survey into five sequential perspectives: Foundation, Evolution, Mechanism, Ability, and Outlook. We begin by delineating the scope of latent space, distinguishing it from explicit or verbal space and from the latent spaces commonly studied in generative visual models. We then trace the field's evolution from early exploratory efforts to the current large-scale expansion. To organize the technical landscape, we examine existing work through the complementary lenses of mechanism and ability. From the perspective of Mechanism, we identify four major lines of development: Architecture, Representation, Computation, and Optimization. From the perspective of Ability, we show how latent space supports a broad capability spectrum spanning Reasoning, Planning, Modeling, Perception, Memory, Collaboration, and Embodiment. Beyond consolidation, we discuss the key open challenges, and outline promising directions for future research. We hope this survey serves not only as a reference for existing work, but also as a foundation for understanding latent space as a general computational and systems paradigm for next-generation intelligence.

摘要:潛在空間正迅速成為基於語言的模型的原生基底。雖然現代系統仍然通常通過明確的標記級生成來理解,但越來越多的研究顯示,許多關鍵的內部過程在連續的潛在空間中比在人類可讀的語言痕跡中更自然地進行。這一轉變是由於明確空間計算的結構限制,包括語言冗餘、離散化瓶頸、序列效率低下和語義損失。這項調查旨在提供基於語言的模型中潛在空間的統一和最新的全景。我們將調查分為五個連續的視角:基礎、演變、機制、能力和展望。我們首先界定潛在空間的範疇,將其與明確或語言空間以及在生成視覺模型中常見的潛在空間區分開來。然後,我們追溯該領域從早期探索性努力到當前大規模擴展的演變。為了組織技術景觀,我們通過機制和能力的互補視角來檢視現有的工作。在機制的視角下,我們確定了四個主要的發展方向:架構、表示、計算和優化。在能力的視角下,我們展示了潛在空間如何支持跨越推理、規劃、建模、感知、記憶、協作和具身等廣泛能力範疇。除了整合之外,我們還討論了關鍵的未解挑戰,並概述了未來研究的有前景方向。我們希望這項調查不僅能作為現有工作的參考,還能作為理解潛在空間作為下一代智能的一般計算和系統範式的基礎。

Why Gaussian Diffusion Models Fail on Discrete Data?

2604.02028v1 by Alexander Shabalin, Simon Elistratov, Viacheslav Meshchaninov, Ildus Sadrtdinov, Dmitry Vetrov

Diffusion models have become a standard approach for generative modeling in continuous domains, yet their application to discrete data remains challenging. We investigate why Gaussian diffusion models with the DDPM solver struggle to sample from discrete distributions that are represented as a mixture of delta-distributions in the continuous space. Using a toy Random Hierarchy Model, we identify a critical sampling interval in which the density of noisified data becomes multimodal. In this regime, DDPM occasionally enters low-density regions between modes producing out-of-distribution inputs for the model and degrading sample quality. We show that existing heuristics, including self-conditioning and a solver we term q-sampling, help alleviate this issue. Furthermore, we demonstrate that combining self-conditioning with switching from DDPM to q-sampling within the critical interval improves generation quality on real data. We validate these findings across conditional and unconditional tasks in multiple domains, including text, programming code, and proteins.

摘要:擴散模型已成為連續領域生成建模的標準方法,但將其應用於離散數據仍然具有挑戰性。我們調查為什麼使用 DDPM 解算器的高斯擴散模型在從表示為連續空間中德爾塔分佈混合的離散分佈中進行取樣時會遇到困難。利用一個玩具隨機層次模型,我們確定了一個關鍵取樣區間,在該區間中,噪聲數據的密度變得多模態。在這種情況下,DDPM 偶爾會進入模式之間的低密度區域,產生模型的分佈外輸入,並降低樣本質量。我們展示了現有的啟發式方法,包括自我條件化和我們稱之為 q-取樣的解算器,有助於緩解此問題。此外,我們證明將自我條件化與在關鍵區間內從 DDPM 切換到 q-取樣相結合,可以改善真實數據的生成質量。我們在多個領域的條件和無條件任務中驗證了這些發現,包括文本、程式碼和蛋白質。

ATBench: A Diverse and Realistic Trajectory Benchmark for Long-Horizon Agent Safety

2604.02022v1 by Yu Li, Haoyu Luo, Yuejin Xie, Yuqian Fu, Zhonghao Yang, Shuai Shao, Qihan Ren, Wanying Qu, Yanwei Fu, Yujiu Yang, Jing Shao, Xia Hu, Dongrui Liu

Evaluating the safety of LLM-based agents is increasingly important because risks in realistic deployments often emerge over multi-step interactions rather than isolated prompts or final responses. Existing trajectory-level benchmarks remain limited by insufficient interaction diversity, coarse observability of safety failures, and weak long-horizon realism. We introduce ATBench, a trajectory-level benchmark for structured, diverse, and realistic evaluation of agent safety. ATBench organizes agentic risk along three dimensions: risk source, failure mode, and real-world harm. Based on this taxonomy, we construct trajectories with heterogeneous tool pools and a long-context delayed-trigger protocol that captures realistic risk emergence across multiple stages. The benchmark contains 1,000 trajectories (503 safe and 497 unsafe), averaging 9.01 turns and 3.95k tokens, with 1,954 invoked tools drawn from pools spanning 2,084 available tools. Data quality is supported by rule-based and LLM-based filtering plus full human audit. Experiments on frontier LLMs, open-source models, and specialized guard systems show that ATBench is challenging even for strong evaluators, while enabling taxonomy-stratified analysis, cross-benchmark comparison, and diagnosis of long-horizon failure patterns.

摘要:評估基於 LLM 的代理的安全性變得越來越重要,因為在現實部署中,風險往往是在多步互動中出現,而不是孤立的提示或最終回應。現有的軌跡級基準受到互動多樣性不足、安全失敗的粗略可觀察性以及長期現實性較弱的限制。我們引入了 ATBench,一個用於結構化、多樣化和現實評估代理安全性的軌跡級基準。ATBench 從三個維度組織代理風險:風險來源、失敗模式和現實世界的傷害。基於這一分類法,我們構建了具有異質工具池和長上下文延遲觸發協議的軌跡,捕捉多個階段中現實風險的出現。該基準包含 1,000 條軌跡(503 條安全和 497 條不安全),平均 9.01 次回合和 3.95k 令牌,調用的工具數量為 1,954,來自 2,084 個可用工具的池中。數據質量由基於規則和基於 LLM 的過濾以及全面的人類審核支持。對前沿 LLM、開源模型和專門的防護系統的實驗表明,ATBench 對於強大的評估者來說也是具有挑戰性的,同時能夠實現分類法分層分析、跨基準比較和長期失敗模式的診斷。

Optimizing Interventions for Agent-Based Infectious Disease Simulations

2604.02016v1 by Anja Wolpers, Johannes Ponge, Adelinde M. Uhrmacher

Non-pharmaceutical interventions (NPIs) are commonly used tools for controlling infectious disease transmission when pharmaceutical options are unavailable. Yet, identifying effective interventions that minimize societal disruption remains challenging. Agent-based simulation is a popular tool for analyzing the impact of possible interventions in epidemiology. However, automatically optimizing NPIs using agent-based simulations poses a complex problem because, in agent-based epidemiological models, interventions can target individuals based on multiple attributes, affect hierarchical group structures (e.g., schools, workplaces, and families), and be combined arbitrarily, resulting in a very large or even infinite search space. We aim to support decision-makers with our Agent-based Infectious Disease Intervention Optimization System (ADIOS) that optimizes NPIs for infectious disease simulations using Grammar-Guided Genetic Programming (GGGP). The core of ADIOS is a domain-specific language for expressing NPIs in agent-based simulations that structures the intervention search space through a context-free grammar. To make optimization more efficient, the search space can be further reduced by defining constraints that prevent the generation of semantically invalid intervention patterns. Using this constrained language and an interface that enables coupling with agent-based simulations, ADIOS adopts the GGGP approach for simulation-based optimization. Using the German Epidemic Micro-Simulation System (GEMS) as a case study, we demonstrate the potential of our approach to generate optimal interventions for realistic epidemiological models

摘要:非藥物干預措施(NPIs)是控制傳染病傳播的常用工具,尤其在藥物選擇不可用時。然而,識別能夠最小化社會干擾的有效干預措施仍然是一項挑戰。基於代理的模擬是一種流行的工具,用於分析可能干預措施在流行病學中的影響。然而,使用基於代理的模擬自動優化NPIs是一個複雜的問題,因為在基於代理的流行病學模型中,干預措施可以根據多個屬性針對個體,影響層級群體結構(例如,學校、工作場所和家庭),並且可以任意組合,這導致了非常大甚至無限的搜索空間。我們的目標是通過我們的基於代理的傳染病干預優化系統(ADIOS)來支持決策者,該系統使用語法引導的遺傳編程(GGGP)優化傳染病模擬中的NPIs。ADIOS的核心是一種特定領域語言,用於在基於代理的模擬中表達NPIs,通過上下文無關文法結構化干預搜索空間。為了提高優化效率,可以通過定義約束來進一步減少搜索空間,以防止生成語義上無效的干預模式。利用這種受約束的語言和一個使其能夠與基於代理的模擬耦合的介面,ADIOS採用了基於模擬的優化的GGGP方法。以德國流行病微模擬系統(GEMS)為案例,我們展示了我們的方法在生成現實流行病學模型的最佳干預措施方面的潛力。

$k$NNProxy: Efficient Training-Free Proxy Alignment for Black-Box Zero-Shot LLM-Generated Text Detection

2604.02008v1 by Kahim Wong, Kemou Li, Haiwei Wu, Jiantao Zhou

LLM-generated text (LGT) detection is essential for reliable forensic analysis and for mitigating LLM misuse. Existing LGT detectors can generally be categorized into two broad classes: learning-based approaches and zero-shot methods. Compared with learning-based detectors, zero-shot methods are particularly promising because they eliminate the need to train task-specific classifiers. However, the reliability of zero-shot methods fundamentally relies on the assumption that an off-the-shelf proxy LLM is well aligned with the often unknown source LLM, a premise that rarely holds in real-world black-box scenarios. To address this discrepancy, existing proxy alignment methods typically rely on supervised fine-tuning of the proxy or repeated interactions with commercial APIs, thereby increasing deployment costs, exposing detectors to silent API changes, and limiting robustness under domain shift. Motivated by these limitations, we propose the $k$-nearest neighbor proxy ($k$NNProxy), a training-free and query-efficient proxy alignment framework that repurposes the $k$NN language model ($k$NN-LM) retrieval mechanism as a domain adapter for a fixed proxy LLM. Specifically, a lightweight datastore is constructed once from a target-reflective LGT corpus, either via fixed-budget querying or from existing datasets. During inference, nearest-neighbor evidence induces a token-level predictive distribution that is interpolated with the proxy output, yielding an aligned prediction without proxy fine-tuning or per-token API outputs. To improve robustness under domain shift, we extend $k$NNProxy into a mixture of proxies (MoP) that routes each input to a domain-specific datastore for domain-consistent retrieval. Extensive experiments demonstrate strong detection performance of our method.

摘要:LLM 生成文本 (LGT) 偵測對於可靠的法醫分析和減少 LLM 誤用至關重要。現有的 LGT 偵測器通常可以分為兩大類:基於學習的方法和零-shot 方法。與基於學習的偵測器相比,零-shot 方法特別有前景,因為它們消除了訓練特定任務分類器的需要。然而,零-shot 方法的可靠性基本上依賴於一個現成的代理 LLM 與通常未知的源 LLM 之間的良好對齊,這一前提在現實世界的黑箱場景中很少成立。為了解決這一差異,現有的代理對齊方法通常依賴於對代理的監督微調或與商業 API 的重複互動,從而增加了部署成本,使偵測器面臨靜默 API 變更的風險,並限制了在領域轉移下的穩健性。受到這些限制的啟發,我們提出了 $k$-最近鄰代理 ($k$NNProxy),這是一個無需訓練且查詢效率高的代理對齊框架,將 $k$NN 語言模型 ($k$NN-LM) 檢索機制重新用作固定代理 LLM 的領域適配器。具體而言,從目標反映的 LGT 語料庫中構建一個輕量級數據存儲,無論是通過固定預算查詢還是從現有數據集獲得。在推理過程中,最近鄰證據會產生一個基於標記的預測分佈,該分佈與代理輸出進行插值,從而產生對齊的預測,而無需對代理進行微調或每個標記的 API 輸出。為了提高在領域轉移下的穩健性,我們將 $k$NNProxy 擴展為一個代理混合體 (MoP),將每個輸入路由到特定領域的數據存儲以進行領域一致的檢索。廣泛的實驗表明我們的方法具有強大的偵測性能。

ProCeedRL: Process Critic with Exploratory Demonstration Reinforcement Learning for LLM Agentic Reasoning

2604.02006v1 by Jingyue Gao, Yanjiang Guo, Xiaoshuai Chen, Jianyu Chen

Reinforcement Learning (RL) significantly enhances the reasoning abilities of large language models (LLMs), yet applying it to multi-turn agentic tasks remains challenging due to the long-horizon nature of interactions and the stochasticity of environmental feedback. We identify a structural failure mode in agentic exploration: suboptimal actions elicit noisy observations into misleading contexts, which further weaken subsequent decision-making, making recovery increasingly difficult. This cumulative feedback loop of errors renders standard exploration strategies ineffective and susceptible to the model's reasoning and the environment's randomness. To mitigate this issue, we propose ProCeedRL: Process Critic with Explorative Demonstration RL, shifting exploration from passive selection to active intervention. ProCeedRL employs a process-level critic to monitor interactions in real time, incorporating reflection-based demonstrations to guide agents in stopping the accumulation of errors. We find that this approach significantly exceeds the model's saturated exploration performance, demonstrating substantial exploratory benefits. By learning from exploratory demonstrations and on-policy samples, ProCeedRL significantly improves exploration efficiency and achieves superior performance on complex deep search and embodied tasks.

摘要:強化學習(RL)顯著提升了大型語言模型(LLMs)的推理能力,然而,由於互動的長期性質和環境反饋的隨機性,將其應用於多回合的代理任務仍然具有挑戰性。我們識別出代理探索中的一種結構性失敗模式:次優行為引發的噪聲觀察進入誤導性上下文,進一步削弱後續的決策,使得恢復變得越來越困難。這種錯誤的累積反饋循環使得標準探索策略無效且容易受到模型推理和環境隨機性的影響。為了減輕這一問題,我們提出了ProCeedRL:帶有探索性示範的過程評價強化學習,將探索從被動選擇轉變為主動干預。ProCeedRL使用過程級別的評價器實時監控互動,結合基於反思的示範來指導代理停止錯誤的累積。我們發現這種方法顯著超過了模型的飽和探索性能,展示了可觀的探索性收益。通過從探索性示範和政策樣本中學習,ProCeedRL顯著提高了探索效率,並在複雜的深度搜索和具身任務上達到了卓越的表現。

How and why does deep ensemble coupled with transfer learning increase performance in bipolar disorder and schizophrenia classification?

2604.02002v1 by Sara Petiton, Antoine Grigis, Benoit Dufumier, Edouard Duchesnay

Transfer learning (TL) and deep ensemble learning (DE) have recently been shown to outperform simple machine learning in classifying psychiatric disorders. However, there is still a lack of understanding as to why that is. This paper aims to understand how and why DE and TL reduce the variability of single-subject classification models in bipolar disorder (BD) and schizophrenia (SCZ). To this end, we investigated the training stability of TL and DE models. For the two classification tasks under consideration, we compared the results of multiple trainings with the same backbone but with different initializations. In this way, we take into account the epistemic uncertainty associated with the uncertainty in the estimation of the model parameters. It has been shown that the performance of classifiers can be significantly improved by using TL with DE. Based on these results, we investigate i) how many models are needed to benefit from the performance improvement of DE when classifying BD and SCZ from healthy controls, and ii) how TL induces better generalization, with and without DE. In the first case, we show that DE reaches a plateau when 10 models are included in the ensemble. In the second case, we find that using a pre-trained model constrains TL models with the same pre-training to stay in the same basin of the loss function. This is not the case for DL models with randomly initialized weights.

摘要:轉移學習(TL)和深度集成學習(DE)最近已被證明在分類精神疾病方面優於簡單的機器學習。然而,對於為什麼會這樣仍然缺乏理解。本論文旨在了解 DE 和 TL 如何以及為何減少雙相情感障礙(BD)和精神分裂症(SCZ)中單一受試者分類模型的變異性。為此,我們調查了 TL 和 DE 模型的訓練穩定性。對於考慮的兩個分類任務,我們比較了使用相同骨幹但不同初始化的多次訓練結果。這樣,我們考慮了與模型參數估計不確定性相關的認識不確定性。研究表明,通過使用 TL 與 DE,分類器的性能可以顯著提高。基於這些結果,我們調查 i) 在從健康對照中分類 BD 和 SCZ 時,需要多少模型才能受益於 DE 的性能提升,以及 ii) TL 如何在有和沒有 DE 的情況下促進更好的泛化。在第一種情況下,我們顯示當集成中包含 10 個模型時,DE 達到了平臺。在第二種情況下,我們發現使用預訓練模型限制了具有相同預訓練的 TL 模型保持在損失函數的同一盆地中。隨機初始化權重的 DL 模型則不是這樣。

GenGait: A Transformer-Based Model for Human Gait Anomaly Detection and Normative Twin Generation

2604.01997v1 by Elisa Motta, Marta Lorenzini, Clara Mouawad, Alberto Ranavolo, Mariano Serrao, Arash Ajoudani

Gait analysis provides an objective characterization of locomotor function and is widely used to support diagnosis and rehabilitation monitoring across neurological and orthopedic disorders. Deep learning has been increasingly applied to this domain, yet most approaches rely on supervised classifiers trained on disease-labeled data, limiting generalization to heterogeneous pathological presentations. This work proposes a label-free framework for joint-level anomaly detection and kinematic correction based on a Transformer masked autoencoder trained exclusively on normative gait sequences from 150 adults, acquired with a markerless multi-camera motion-capture system. At inference, a two-pass procedure is applied to potentially pathological input sequences, first it estimates joint inconsistency scores by occluding individual joints and measuring deviations from the learned normative prior. Then, it withholds the flagged joints from the encoder input and reconstructs the full skeleton from the remaining spatiotemporal context, yielding corrected kinematic trajectories at the flagged positions. Validation on 10 held-out normative participants, who mimicked seven simulated gait abnormalities, showed accurate localization of biomechanically inconsistent joints, a significant reduction in angular deviation across all analyzed joints with large effect sizes, and preservation of normative kinematics. The proposed approach enables interpretable, subject-specific localization of gait impairments without requiring disease labels. Video is available at https://youtu.be/Rcm3jqR5pN4.

摘要:步態分析提供了對運動功能的客觀描述,並廣泛用於支持神經和骨科疾病的診斷及康復監測。深度學習在這一領域的應用日益增多,但大多數方法依賴於在疾病標記數據上訓練的監督分類器,這限制了對異質病理表現的泛化。本文提出了一種無標籤的框架,用於基於僅在150名成年人標準步態序列上訓練的Transformer遮罩自編碼器進行關節級異常檢測和運動學修正,這些數據是通過無標記的多攝像頭運動捕捉系統獲得的。在推斷過程中,對潛在病理的輸入序列應用兩次通過的程序,首先通過遮蔽單個關節來估計關節不一致性得分,並測量與學習的標準先驗的偏差。然後,它從編碼器輸入中排除被標記的關節,並從剩餘的時空上下文重建完整的骨架,從而在被標記的位置產生修正的運動學軌跡。在10名被排除的標準參與者上進行的驗證顯示,這些參與者模擬了七種步態異常,準確定位了生物力學不一致的關節,所有分析關節的角度偏差顯著降低,且效果大小較大,同時保留了標準運動學。所提出的方法使得在不需要疾病標籤的情況下,能夠對步態障礙進行可解釋的、特定於個體的定位。視頻可在 https://youtu.be/Rcm3jqR5pN4 獲得。

SAFE: Stepwise Atomic Feedback for Error correction in Multi-hop Reasoning

2604.01993v1 by Daeyong Kwon, Soyoung Yoon, Seung-won Hwang

Multi-hop QA benchmarks frequently reward Large Language Models (LLMs) for spurious correctness, masking ungrounded or flawed reasoning steps. To shift toward rigorous reasoning, we propose SAFE, a dynamic benchmarking framework that replaces the ungrounded Chain-of-Thought (CoT) with a strictly verifiable sequence of grounded entities. Our framework operates across two phases: (1) train-time verification, where we establish an atomic error taxonomy and a Knowledge Graph (KG)-grounded verification pipeline to eliminate noisy supervision in standard benchmarks, identifying up to 14% of instances as unanswerable, and (2) inference-time verification, where a feedback model trained on this verified dataset dynamically detects ungrounded steps in real-time. Experimental results demonstrate that SAFE not only exposes the critical flaws of existing benchmarks at train-time, but also significantly outperforms standard baselines, achieving an average accuracy gain of 8.4 pp while guaranteeing verifiable trajectories at inference-time.

摘要:多跳 QA 基準常常因表面正確性而獎勵大型語言模型(LLMs),掩蓋了未經證實或有缺陷的推理步驟。為了轉向嚴謹的推理,我們提出了 SAFE,一個動態基準框架,將未經證實的思維鏈(CoT)替換為一個嚴格可驗證的基於實體的序列。 我們的框架分為兩個階段運作: (1) 訓練時驗證,我們建立了一個原子錯誤分類法和一個基於知識圖譜(KG)的驗證管道,以消除標準基準中的噪聲監督,並將多達 14% 的實例識別為無法回答, (2) 推理時驗證,訓練於這個經過驗證的數據集的反饋模型能夠實時動態檢測未經證實的步驟。 實驗結果顯示,SAFE 不僅在訓練時揭示了現有基準的關鍵缺陷,還顯著超越了標準基準,實現了平均準確率提升 8.4 個百分點,同時在推理時保證可驗證的軌跡。

Attention at Rest Stays at Rest: Breaking Visual Inertia for Cognitive Hallucination Mitigation

2604.01989v1 by Boyang Gong, Yu Zheng, Fanye Kong, Jie Zhou, Jiwen Lu

Like a body at rest that stays at rest, we find that visual attention in multimodal large language models (MLLMs) exhibits pronounced inertia, remaining largely static once settled during early decoding steps and failing to support the compositional understanding required for cognitive inference. While existing hallucination mitigation methods mainly target perceptual hallucinations concerning object existence or attributes, they remain inadequate for such cognitive hallucinations that require inter-object relational deduction. Through token-wise attention analysis, we identify this visual inertia as a key factor: attention to semantically critical regions remains persistently focused and fails to dynamically support relational inference. We thereby propose a training-free Inertia-aware Visual Excitation (IVE) method that breaks this inertial pattern by modeling cognitive inference as the dynamic responsiveness of visual attention. Specifically, IVE selects visual tokens that are dynamically emerging relative to historical attention trends while distinguishing tokens exhibiting inertial behavior. To further facilitate compositional inference, IVE introduces an inertia-aware penalty that discourages over-concentration and limits the persistence of attention within localized regions. Extensive experiments show that IVE is effective across various base MLLMs and multiple hallucination benchmarks, particularly for cognitive hallucinations.

摘要:像靜止的物體保持靜止一樣,我們發現多模態大型語言模型(MLLMs)中的視覺注意力顯示出明顯的慣性,一旦在早期解碼步驟中穩定下來,便大致保持靜態,無法支持認知推理所需的組合理解。雖然現有的幻覺緩解方法主要針對與物體存在或屬性相關的感知幻覺,但對於需要物體間關係推理的認知幻覺仍然不足。通過逐個標記的注意力分析,我們將這種視覺慣性確定為一個關鍵因素:對語義關鍵區域的注意力持續集中,無法動態支持關係推理。因此,我們提出了一種無需訓練的慣性感知視覺激勵(IVE)方法,通過將認知推理建模為視覺注意力的動態響應來打破這種慣性模式。具體而言,IVE選擇相對於歷史注意力趨勢動態出現的視覺標記,同時區分表現出慣性行為的標記。為了進一步促進組合推理,IVE引入了一種慣性感知懲罰,旨在抑制過度集中並限制注意力在局部區域內的持續性。大量實驗表明,IVE在各種基礎MLLM和多個幻覺基準上均有效,特別是對於認知幻覺。

SenseMath: Do LLMs Have Number Sense? Evaluating Shortcut Use, Judgment, and Generation

2604.01988v1 by Haomin Zhuang, Xiangqi Wang, Yili Shen, Ying Cheng, Xiangliang Zhang

Large language models often default to step-by-step computation even when efficient numerical shortcuts are available. This raises a basic question: do they exhibit number sense in a human-like behavioral sense, i.e., the ability to recognize numerical structure, apply shortcuts when appropriate, and avoid them when they are not? We introduce SenseMath, a controlled benchmark for evaluating structure-sensitive numerical reasoning in LLMs. SenseMath contains 4,800 items spanning eight shortcut categories and four digit scales, with matched strong-shortcut, weak-shortcut, and control variants. It supports three evaluation settings of increasing cognitive demand: Shortcut Use (whether models can apply shortcuts on shortcut-amenable problems); Applicability Judgment (whether they can recognize when a shortcut is appropriate or misleading); and Problem Generation (whether they can generate new problem items that correctly admit a given type of shortcut). Our evaluation across five LLMs, ranging from GPT-4o-mini to Llama-3.1-8B, shows a consistent pattern: when explicitly prompted, models readily adopt shortcut strategies and achieve substantial accuracy gains on shortcut-amenable items (up to 15%), yet under standard chain-of-thought prompting they spontaneously employ such strategies in fewer than 40% of cases, even when they demonstrably possess the requisite capability. Moreover, this competence is confined to the Use level; models systematically over-generalise shortcuts to problems where they do not apply, and fail to generate valid shortcut-bearing problems from scratch. Together, these results suggest that current LLMs exhibit procedural shortcut fluency without the structural understanding of when and why shortcuts work that underlies human number sense.

摘要:大型語言模型在有高效數值捷徑可用的情況下,經常默認採用逐步計算。這引發了一個基本問題:它們是否在類人行為意義上展現了數字感,即識別數字結構的能力、在適當時應用捷徑的能力,以及在不適當時避免使用捷徑的能力?我們介紹了SenseMath,一個用於評估大型語言模型中結構敏感數值推理的控制基準。SenseMath包含4800個項目,涵蓋八個捷徑類別和四個數字範疇,並配有匹配的強捷徑、弱捷徑和控制變體。它支持三種逐漸增加認知需求的評估設置:捷徑使用(模型是否能在適合捷徑的問題上應用捷徑);適用性判斷(它們是否能識別何時捷徑是合適的或具誤導性的);以及問題生成(它們是否能生成正確允許給定類型捷徑的新問題項目)。我們對五個大型語言模型的評估,從GPT-4o-mini到Llama-3.1-8B,顯示出一致的模式:當被明確提示時,模型會輕易採用捷徑策略,並在適合捷徑的項目上實現顯著的準確性提升(高達15%),然而在標準的思維鏈提示下,它們自發地在不到40%的情況下使用這些策略,即使它們顯然具備所需的能力。此外,這種能力僅限於使用層級;模型系統性地將捷徑過度泛化到不適用的問題上,並無法從零開始生成有效的含捷徑的問題。綜合這些結果表明,目前的大型語言模型展現了程序性捷徑流利性,但缺乏人類數字感所需的對捷徑何時及為何有效的結構理解。

World Action Verifier: Self-Improving World Models via Forward-Inverse Asymmetry

2604.01985v1 by Yuejiang Liu, Fan Feng, Lingjing Kong, Weifeng Lu, Jinzhou Tang, Kun Zhang, Kevin Murphy, Chelsea Finn, Yilun Du

General-purpose world models promise scalable policy evaluation, optimization, and planning, yet achieving the required level of robustness remains challenging. Unlike policy learning, which primarily focuses on optimal actions, a world model must be reliable over a much broader range of suboptimal actions, which are often insufficiently covered by action-labeled interaction data. To address this challenge, we propose World Action Verifier (WAV), a framework that enables world models to identify their own prediction errors and self-improve. The key idea is to decompose action-conditioned state prediction into two factors -- state plausibility and action reachability -- and verify each separately. We show that these verification problems can be substantially easier than predicting future states due to two underlying asymmetries: the broader availability of action-free data and the lower dimensionality of action-relevant features. Leveraging these asymmetries, we augment a world model with (i) a diverse subgoal generator obtained from video corpora and (ii) a sparse inverse model that infers actions from a subset of state features. By enforcing cycle consistency among generated subgoals, inferred actions, and forward rollouts, WAV provides an effective verification mechanism in under-explored regimes, where existing methods typically fail. Across nine tasks spanning MiniGrid, RoboMimic, and ManiSkill, our method achieves 2x higher sample efficiency while improving downstream policy performance by 18%.

摘要:一般用途的世界模型承諾可擴展的政策評估、優化和規劃,但實現所需的穩健性仍然具有挑戰性。與主要專注於最佳行動的政策學習不同,世界模型必須在更廣泛的次優行動範圍內保持可靠性,而這些行動往往在標記行動的互動數據中未得到充分覆蓋。為了解決這一挑戰,我們提出了世界行動驗證器(WAV),這是一個使世界模型能夠識別自身預測錯誤並自我改進的框架。關鍵思想是將行動條件的狀態預測分解為兩個因素——狀態的合理性和行動的可達性——並分別驗證每個因素。我們表明,這些驗證問題可能比預測未來狀態要簡單得多,因為存在兩種潛在的不對稱性:行動自由數據的更廣泛可用性和與行動相關特徵的較低維度性。利用這些不對稱性,我們用(i)從視頻語料庫獲得的多樣化子目標生成器和(ii)從一組狀態特徵推斷行動的稀疏逆模型來增強世界模型。通過強制生成的子目標、推斷的行動和前向展開之間的循環一致性,WAV在未充分探索的範疇中提供了一個有效的驗證機制,而現有方法通常在這些範疇中失敗。在涵蓋 MiniGrid、RoboMimic 和 ManiSkill 的九個任務中,我們的方法實現了 2 倍的樣本效率,同時將下游政策表現提高了 18%。

RuleForge: Automated Generation and Validation for Web Vulnerability Detection at Scale

2604.01977v1 by Ayush Garg, Sophia Hager, Jacob Montiel, Aditya Tiwari, Michael Gentile, Zach Reavis, David Magnotti, Wayne Fullen

Security teams face a challenge: the volume of newly disclosed Common Vulnerabilities and Exposures (CVEs) far exceeds the capacity to manually develop detection mechanisms. In 2025, the National Vulnerability Database published over 48,000 new vulnerabilities, motivating the need for automation. We present RuleForge, an AWS internal system that automatically generates detection rules--JSON-based patterns that identify malicious HTTP requests exploiting specific vulnerabilities--from structured Nuclei templates describing CVE details. Nuclei templates provide standardized, YAML-based vulnerability descriptions that serve as the structured input for our rule generation process. This paper focuses on RuleForge's architecture and operational deployment for CVE-related threat detection, with particular emphasis on our novel LLM-as-a-judge (Large Language Model as judge) confidence validation system and systematic feedback integration mechanism. This validation approach evaluates candidate rules across two dimensions--sensitivity (avoiding false negatives) and specificity (avoiding false positives)--achieving AUROC of 0.75 and reducing false positives by 67% compared to synthetic-test-only validation in production. Our 5x5 generation strategy (five parallel candidates with up to five refinement attempts each) combined with continuous feedback loops enables systematic quality improvement. We also present extensions enabling rule generation from unstructured data sources and demonstrate a proof-of-concept agentic workflow for multi-event-type detection. Our lessons learned highlight critical considerations for applying LLMs to cybersecurity tasks, including overconfidence mitigation and the importance of domain expertise in both prompt design and quality review of generated rules through human-in-the-loop validation.

摘要:安全團隊面臨一個挑戰:新披露的常見漏洞和暴露(CVE)的數量遠超過手動開發檢測機制的能力。到2025年,國家漏洞數據庫發布了超過48,000個新漏洞,促使自動化的需求。 我們介紹了RuleForge,一個AWS內部系統,能夠自動生成檢測規則——基於JSON的模式,用於識別利用特定漏洞的惡意HTTP請求——這些規則來自描述CVE細節的結構化Nuclei模板。Nuclei模板提供標準化的、基於YAML的漏洞描述,作為我們規則生成過程的結構化輸入。
本文重點介紹RuleForge的架構和針對CVE相關威脅檢測的運行部署,特別強調我們新穎的LLM作為裁判(Large Language Model as judge)信心驗證系統和系統化反饋整合機制。這種驗證方法在兩個維度上評估候選規則——靈敏度(避免漏報)和特異性(避免誤報)——實現了0.75的AUROC,並將與僅進行合成測試的驗證相比,誤報率降低了67%。我們的5x5生成策略(五個平行候選,每個最多五次精煉嘗試)結合持續的反饋循環,實現了系統化的質量改進。我們還展示了從非結構化數據源生成規則的擴展,並演示了一個概念驗證的代理工作流程,用於多事件類型的檢測。我們的經驗教訓強調了將LLM應用於網絡安全任務的關鍵考量,包括過度自信的緩解以及在提示設計和通過人機協作驗證生成規則的質量審查中領域專業知識的重要性。

Ego-Grounding for Personalized Question-Answering in Egocentric Videos

2604.01966v1 by Junbin Xiao, Shenglang Zhang, Pengxiang Zhu, Angela Yao

We present the first systematic analysis of multimodal large language models (MLLMs) in personalized question-answering requiring ego-grounding - the ability to understand the camera-wearer in egocentric videos. To this end, we introduce MyEgo, the first egocentric VideoQA dataset designed to evaluate MLLMs' ability to understand, remember, and reason about the camera wearer. MyEgo comprises 541 long videos and 5K personalized questions asking about "my things", "my activities", and "my past". Benchmarking reveals that competitive MLLMs across variants, including open-source vs. proprietary, thinking vs. non-thinking, small vs. large scales all struggle on MyEgo. Top closed- and open-source models (e.g., GPT-5 and Qwen3-VL) achieve only~46% and 36% accuracy, trailing human performance by near 40% and 50% respectively. Surprisingly, neither explicit reasoning nor model scaling yield consistent improvements. Models improve when relevant evidence is explicitly provided, but gains drop over time, indicating limitations in tracking and remembering "me" and "my past". These findings collectively highlight the crucial role of ego-grounding and long-range memory in enabling personalized QA in egocentric videos. We hope MyEgo and our analyses catalyze further progress in these areas for egocentric personalized assistance. Data and code are available at https://github.com/Ryougetsu3606/MyEgo

摘要:我們提出了第一個系統性分析多模態大型語言模型(MLLMs)在需要自我基礎的個性化問答中的應用——即理解在自我中心視頻中攝影者的能力。為此,我們介紹了MyEgo,第一個專為評估MLLMs理解、記憶和推理攝影者能力而設計的自我中心視頻問答數據集。MyEgo包含541個長視頻和5000個個性化問題,詢問有關「我的物品」、「我的活動」和「我的過去」。基準測試顯示,各種競爭性的MLLMs,包括開源與專有、思考與非思考、小規模與大規模,都在MyEgo上遇到困難。頂級的封閉源和開源模型(例如,GPT-5和Qwen3-VL)僅達到約46%和36%的準確率,分別落後於人類表現近40%和50%。令人驚訝的是,無論是明確推理還是模型擴展都無法帶來一致的改進。當相關證據被明確提供時,模型有所改善,但隨著時間的推移,增益下降,顯示出在追蹤和記憶「我」和「我的過去」方面的局限性。這些發現共同強調了自我基礎和長期記憶在促進自我中心視頻中的個性化問答中的關鍵作用。我們希望MyEgo和我們的分析能促進這些領域在自我中心個性化輔助方面的進一步進展。數據和代碼可在 https://github.com/Ryougetsu3606/MyEgo 獲得。

Do We Need Bigger Models for Science? Task-Aware Retrieval with Small Language Models

2604.01965v1 by Florian Kelber, Matthias Jobst, Yuni Susanti, Michael Färber

Scientific knowledge discovery increasingly relies on large language models, yet many existing scholarly assistants depend on proprietary systems with tens or hundreds of billions of parameters. Such reliance limits reproducibility and accessibility for the research community. In this work, we ask a simple question: do we need bigger models for scientific applications? Specifically, we investigate to what extent carefully designed retrieval pipelines can compensate for reduced model scale in scientific applications. We design a lightweight retrieval-augmented framework that performs task-aware routing to select specialized retrieval strategies based on the input query. The system further integrates evidence from full-text scientific papers and structured scholarly metadata, and employs compact instruction-tuned language models to generate responses with citations. We evaluate the framework across several scholarly tasks, focusing on scholarly question answering (QA), including single- and multi-document scenarios, as well as biomedical QA under domain shift and scientific text compression. Our findings demonstrate that retrieval and model scale are complementary rather than interchangeable. While retrieval design can partially compensate for smaller models, model capacity remains important for complex reasoning tasks. This work highlights retrieval and task-aware design as key factors for building practical and reproducible scholarly assistants.

摘要:科學知識發現越來越依賴大型語言模型,但許多現有的學術助手卻依賴於擁有數十億或數百億參數的專有系統。這種依賴限制了研究社群的可重複性和可及性。在這項工作中,我們提出了一個簡單的問題:我們是否需要更大的模型來應用於科學?具體來說,我們調查精心設計的檢索管道在多大程度上可以彌補科學應用中模型規模的減少。我們設計了一個輕量級的檢索增強框架,該框架執行任務感知路由,以根據輸入查詢選擇專門的檢索策略。該系統進一步整合來自全文科學論文和結構化學術元數據的證據,並使用緊湊的指令調整語言模型生成帶有引用的回應。我們在幾個學術任務中評估該框架,重點關注學術問答(QA),包括單文檔和多文檔場景,以及在領域轉移和科學文本壓縮下的生物醫學QA。我們的研究結果表明,檢索和模型規模是互補的,而非可互換的。雖然檢索設計可以部分彌補較小模型的不足,但模型容量在複雜推理任務中仍然重要。這項工作突顯了檢索和任務感知設計是構建實用且可重複的學術助手的關鍵因素。

Abnormal Head Movements in Neurological Conditions: A Knowledge-Based Dataset with Application to Cervical Dystonia

2604.01962v1 by Saja Al-Dabet, Sherzod Turaev, Nazar Zaki

Abnormal head movements (AHMs) manifest across a broad spectrum of neurological disorders; however, the absence of a multi-condition resource integrating kinematic measurements, clinical severity scores, and patient demographics constitutes a persistent barrier to the development of AI-driven diagnostic tools. To address this gap, this study introduces NeuroPose-AHM, a knowledge-based dataset of neurologically induced AHMs constructed through a multi-LLM extraction framework applied to 1,430 peer-reviewed publications. The dataset contains 2,756 patient-group-level records spanning 57 neurological conditions, derived from 846 AHM-relevant papers. Inter-LLM reliability analysis confirms robust extraction performance, with study-level classification achieving strong agreement (kappa = 0.822). To demonstrate the dataset's analytical utility, a four-task framework is applied to cervical dystonia (CD), the condition most directly defined by pathological head movement. First, Task 1 performs multi-label AHM type classification (F1 = 0.856). Task 2 constructs the Head-Neck Severity Index (HNSI), a unified metric that normalizes heterogeneous clinical rating scales. The clinical relevance of this index is then evaluated in Task 3, where HNSI is validated against real-world CD patient data, with aligned severe-band proportions (6.7%) providing a preliminary plausibility indication for index calibration within the high severity range. Finally, Task 4 performs bridge analysis between movement-type probabilities and HNSI scores, producing significant correlations (p less than 0.001). These results demonstrate the analytical utility of NeuroPose-AHM as a structured, knowledge-based resource for neurological AHM research. The NeuroPose-AHM dataset is publicly available on Zenodo (https://doi.org/10.5281/zenodo.19386862).

摘要:異常頭部運動(AHMs)在廣泛的神經疾病中表現出來;然而,缺乏一個整合運動學測量、臨床嚴重程度評分和患者人口統計的多條件資源,構成了開發基於人工智慧的診斷工具的持續障礙。為了解決這一問題,本研究介紹了NeuroPose-AHM,這是一個基於知識的神經誘發AHMs數據集,通過應用於1,430篇經過同行評審的出版物的多LLM提取框架構建而成。該數據集包含2,756個患者群體級別的記錄,涵蓋57種神經疾病,來源於846篇與AHM相關的論文。跨LLM可靠性分析確認了穩健的提取性能,研究級別的分類達到強一致性(kappa = 0.822)。為了展示該數據集的分析效用,將四任務框架應用於頸部肌張力障礙(CD),這是由病理性頭部運動最直接定義的疾病。首先,任務1執行多標籤AHM類型分類(F1 = 0.856)。任務2構建頭頸嚴重程度指數(HNSI),這是一個統一的指標,將異質的臨床評分標準進行標準化。然後在任務3中評估該指數的臨床相關性,其中HNSI與現實世界的CD患者數據進行驗證,對應的重度比例(6.7%)為指數在高嚴重程度範圍內的校準提供了初步的合理性指示。最後,任務4在運動類型概率和HNSI分數之間進行橋接分析,產生了顯著的相關性(p小於0.001)。這些結果展示了NeuroPose-AHM作為一個結構化的、基於知識的神經AHM研究資源的分析效用。NeuroPose-AHM數據集在Zenodo上公開可用(https://doi.org/10.5281/zenodo.19386862)。

Diagnosing Translated Benchmarks: An Automated Quality Assurance Study of the EU20 Benchmark Suite

2604.01957v1 by Klaudia Thellmann, Bernhard Stadler, Michael Färber

Machine-translated benchmark datasets reduce costs and offer scale, but noise, loss of structure, and uneven quality weaken confidence. What matters is not merely whether we can translate, but also whether we can measure and verify translation reliability at scale. We study translation quality in the EU20 benchmark suite, which comprises five established benchmarks translated into 20 languages, via a three-step automated quality assurance approach: (i) a structural corpus audit with targeted fixes; (ii) quality profiling using a neural metric (COMET, reference-free and reference-based) with translation service comparisons (DeepL / ChatGPT / Google); and (iii) an LLM-based span-level translation error landscape. Trends are consistent: datasets with lower COMET scores exhibit a higher share of accuracy/mistranslation errors at span level (notably HellaSwag; ARC is comparatively clean). Reference-based COMET on MMLU against human-edited samples points in the same direction. We release cleaned/corrected versions of the EU20 datasets, and code for reproducibility. In sum, automated quality assurance offers practical, scalable indicators that help prioritize review -- complementing, not replacing, human gold standards.

摘要:機器翻譯基準數據集降低了成本並提供了規模,但噪音、結構損失和質量不均削弱了信心。重要的不是我們是否能翻譯,而是我們是否能在大規模上測量和驗證翻譯的可靠性。我們研究了EU20基準套件中的翻譯質量,該套件由五個已建立的基準翻譯成20種語言,通過三步自動質量保證方法進行: (i) 針對性修正的結構語料庫審核; (ii) 使用神經度量(COMET,無參考和有參考)進行質量概況分析,並與翻譯服務(DeepL / ChatGPT / Google)進行比較; (iii) 基於LLM的跨度級翻譯錯誤景觀。趨勢是一致的:具有較低COMET分數的數據集在跨度級別上顯示出更高的準確性/誤譯錯誤比例(特別是HellaSwag;ARC相對乾淨)。基於參考的COMET在MMLU上對人類編輯樣本的評估指向相同的方向。我們發布了EU20數據集的清理/修正版本,以及可重現性的代碼。總之,自動質量保證提供了實用的、可擴展的指標,幫助優先考慮審核——補充而不是取代人類的金標準。

Physics-Informed Transformer for Multi-Band Channel Frequency Response Reconstruction

2604.01944v1 by Anatolij Zubow, Joana Angjo, Sigrid Dimce, Falko Dressler

Wideband channel frequency response (CFR) estimation is challenging in multi-band wireless systems, especially when one or more sub-bands are temporarily blocked by co-channel interference. We present a physics-informed complex Transformer that reconstructs the full wideband CFR from such fragmented, partially observed spectrum snapshots. The interference pattern in each sub-band is modeled as an independent two-state discrete-time Markov chain, capturing realistic bursty occupancy behavior. Our model operates on the joint time-frequency grid of $T$ snapshots and $F$ frequency bins and uses a factored self-attention mechanism that separately attends along both axes, reducing the computational complexity to $O(TF^2 + FT^2)$. Complex-valued inputs and outputs are processed through a holomorphic linear layer that preserves phase relationships. Training uses a composite physics-informed loss combining spectral fidelity, power delay profile (PDP) reconstruction, channel impulse response (CIR) sparsity, and temporal smoothness. Mobility effects are incorporated through per-sample velocity randomization, enabling generalization across different mobility regimes. Evaluation against three classical baselines, namely, last-observation-carry-forward, zero-fill, and cubic-spline interpolation, shows that our approach achieves the highest PDP similarity with respect to the ground truth, reaching $ρ\geq 0.82$ compared to $ρ\geq 0.62$ for the best baseline at interference occupancy levels up to 50%. Furthermore, the model degrades smoothly across the full velocity range, consistently outperforming all other baselines.

摘要:寬頻通道頻率響應 (CFR) 估計在多頻帶無線系統中具有挑戰性,特別是當一個或多個子頻帶因同頻干擾而暫時被阻塞時。我們提出了一種物理知識驅動的複數Transformer,能夠從這些片段化的、部分觀察到的頻譜快照中重建完整的寬頻 CFR。每個子頻帶中的干擾模式被建模為獨立的雙狀態離散時間馬爾可夫鏈,捕捉現實的突發佔用行為。我們的模型在 $T$ 個快照和 $F$ 個頻率區間的聯合時間-頻率網格上運行,並使用一種分解的自注意力機制,分別沿兩個軸進行注意,將計算複雜度降低到 $O(TF^2 + FT^2)$。複數值的輸入和輸出通過一個全純線性層進行處理,保持相位關係。訓練使用一種綜合的物理知識驅動損失,結合頻譜保真度、功率延遲輪廓 (PDP) 重建、通道脈衝響應 (CIR) 稀疏性和時間平滑性。通過每個樣本的速度隨機化納入了移動效應,使得模型能夠在不同的移動範疇中進行泛化。與三個經典基準進行評估,即最後觀察延續、零填充和三次樣條插值,顯示我們的方法在真實情況下達到了最高的 PDP 相似度,達到 $ρ\geq 0.82$,而最佳基準在干擾佔用水平高達 50% 時僅達到 $ρ\geq 0.62$。此外,該模型在整個速度範圍內平滑降級,始終優於所有其他基準。

Captioning Daily Activity Images in Early Childhood Education: Benchmark and Algorithm

2604.01941v1 by Sixing Li, Zhibin Gu, Ziqi Zhang, Weiguo Pan, Bing Li, Ying Wang, Hongzhe Liu

Image captioning for Early Childhood Education (ECE) is essential for automated activity understanding and educational assessment. However, existing methods face two key challenges. First, the lack of large-scale, domain-specific datasets limits the model's ability to capture fine-grained semantic concepts unique to ECE scenarios, resulting in generic and imprecise descriptions. Second, conventional training paradigms exhibit limitations in enhancing professional object description capability, as supervised learning tends to favor high-frequency expressions, while reinforcement learning may suffer from unstable optimization on difficult samples. To address these limitations, we introduce ECAC, a large-scale benchmark for ECE daily activity image captioning, comprising 256,121 real-world images annotated with expert-level captions and fine-grained labels. ECAC is further equipped with a domain-oriented evaluation protocol, the Teaching Toy Recognition Score (TTS), to explicitly measure professional object naming accuracy. Furthermore, we propose RSRS (Reward-Conditional Switch of Reinforcement Learning and Supervised Fine-Tuning), a hybrid training framework that dynamically alternates between RL and supervised optimization. By rerouting hard samples with zero rewards to supervised fine-tuning, RSRS effectively mitigates advantage collapse and enables stable optimization for fine-grained recognition. Leveraging ECAC and RSRS, we develop KinderMM-Cap-3B, a domain-adapted multimodal large language model. Extensive experiments demonstrate that our model achieves a TTS of 51.06, substantially outperforming state-of-the-art baselines while maintaining superior caption quality, highlighting its potential for specialized educational applications.

摘要:幼兒教育(ECE)的圖像標題生成對於自動化活動理解和教育評估至關重要。然而,現有方法面臨兩個主要挑戰。首先,缺乏大規模、特定領域的數據集限制了模型捕捉獨特於ECE場景的細粒度語義概念的能力,導致描述過於一般和不精確。其次,傳統的訓練範式在提升專業物體描述能力方面存在局限,因為監督學習往往偏向於高頻表達,而強化學習可能在困難樣本上遭遇不穩定的優化。為了解決這些限制,我們引入了ECAC,一個針對ECE日常活動圖像標題生成的大規模基準,包含256,121張經專家標註的真實世界圖像及細粒度標籤。ECAC還配備了一個以領域為導向的評估協議,即教學玩具識別分數(TTS),以明確測量專業物體命名的準確性。此外,我們提出了RSRS(強化學習和監督微調的獎勵條件切換),這是一個混合訓練框架,能夠在RL和監督優化之間動態切換。通過將零獎勵的困難樣本重新導向到監督微調,RSRS有效減輕了優勢崩潰,並實現了細粒度識別的穩定優化。利用ECAC和RSRS,我們開發了KinderMM-Cap-3B,一個領域適應的多模態大型語言模型。廣泛的實驗表明,我們的模型達到了51.06的TTS,顯著超越了最先進的基準,同時保持了優越的標題質量,突顯了其在專業教育應用中的潛力。

Probabilistic classification from possibilistic data: computing Kullback-Leibler projection with a possibility distribution

2604.01939v1 by Ismaïl Baaj, Pierre Marquis

We consider learning with possibilistic supervision for multi-class classification. For each training instance, the supervision is a normalized possibility distribution that expresses graded plausibility over the classes. From this possibility distribution, we construct a non-empty closed convex set of admissible probability distributions by combining two requirements: probabilistic compatibility with the possibility and necessity measures induced by the possibility distribution, and linear shape constraints that must be satisfied to preserve the qualitative structure of the possibility distribution. Thus, classes with the same possibility degree receive equal probabilities, and if a class has a strictly larger possibility degree than another class, then it receives a strictly larger probability. Given a strictly positive probability vector output by a model for an instance, we compute its Kullback-Leibler projection onto the admissible set. This projection yields the closest admissible probability distribution in Kullback-Leibler sense. We can then train the model by minimizing the divergence between the prediction and its projection, which quantifies the smallest adjustment needed to satisfy the induced dominance and shape constraints. The projection is computed with Dykstra's algorithm using Bregman projections associated with the negative entropy, and we provide explicit formulas for the projections onto each constraint set. Experiments conducted on synthetic data and on a real-world natural language inference task, based on the ChaosNLI dataset, show that the proposed projection algorithm is efficient enough for practical use, and that the resulting projection-based learning objective can improve predictive performance.

摘要:我們考慮使用可能性監督進行多類別分類的學習。對於每個訓練實例,監督是一個標準化的可能性分佈,表達了對各類別的分級可信度。基於這個可能性分佈,我們通過結合兩個要求構建一個非空的閉合凸集,該集包含可接受的概率分佈:與可能性和由可能性分佈引起的必要性度量的概率兼容性,以及必須滿足的線性形狀約束,以保持可能性分佈的質量結構。因此,具有相同可能性程度的類別獲得相等的概率,而如果一個類別的可能性程度嚴格大於另一個類別,那麼它就會獲得嚴格更大的概率。給定模型對一個實例輸出的嚴格正概率向量,我們計算其在可接受集上的Kullback-Leibler投影。這個投影產生在Kullback-Leibler意義下最接近的可接受概率分佈。然後,我們可以通過最小化預測與其投影之間的差異來訓練模型,這量化了滿足引起的優勢和形狀約束所需的最小調整。該投影是使用Dykstra算法計算的,利用與負熵相關的Bregman投影,我們提供了對每個約束集的投影的明確公式。在合成數據和基於ChaosNLI數據集的現實世界自然語言推理任務上進行的實驗顯示,所提出的投影算法在實際使用中足夠高效,並且所得到的基於投影的學習目標可以提高預測性能。

How to measure the optimality of word or gesture order with respect to the principle of swap distance minimization

2604.01938v1 by Ramon Ferrer-i-Cancho

The structure of all the permutations of a sequence can be represented as a permutohedron, a graph where vertices are permutations and two vertices are linked if a swap of adjacent elements in the permutation of one of the vertices produces the permutation of the other vertex. It has been hypothesized that word orders in languages minimize the swap distance in the permutohedron: given a source order, word orders that are closer in the permutohedron should be less costly and thus more likely. Here we explain how to measure the degree of optimality of word order variation with respect to swap distance minimization. We illustrate the power of our novel mathematical framework by showing that crosslinguistic gestures are at least $77\%$ optimal. It is unlikely that the multiple times where crosslinguistic gestures hit optimality are due to chance. We establish the theoretical foundations for research on the optimality of word or gesture order with respect to swap distance minimization in communication systems. Finally, we introduce the quadratic assignment problem (QAP) into language research as an umbrella for multiple optimization problems and, accordingly, postulate a general principle of optimal assignment that unifies various linguistic principles including swap distance minimization.

摘要:所有序列的排列結構可以表示為一個排列多面體(permutohedron),這是一個圖,其中頂點是排列,兩個頂點相連如果在其中一個頂點的排列中相鄰元素的交換產生了另一個頂點的排列。有人假設語言中的詞序最小化排列多面體中的交換距離:給定一個源詞序,排列多面體中更接近的詞序應該成本較低,因此更有可能。在這裡,我們解釋如何測量詞序變化的最佳性程度,以最小化交換距離。我們通過顯示跨語言手勢至少達到 $77\%$ 的最佳性來說明我們新數學框架的威力。跨語言手勢多次達到最佳性不太可能是偶然的。我們為關於詞序或手勢序在通訊系統中最小化交換距離的最佳性研究建立了理論基礎。最後,我們將二次分配問題(QAP)引入語言研究,作為多個優化問題的總稱,並因此假設一個統一各種語言原則的最佳分配的一般原則,包括交換距離最小化。

Reliable News or Propagandist News? A Neurosymbolic Model Using Genre, Topic, and Persuasion Techniques to Improve Robustness in Classification

2604.01936v1 by Géraud Faye, Benjamin Icard, Morgane Casanova, Guillaume Gadek, Guillaume Gravier, Wassila Ouerdane, Céline Hudelot, Sylvain Gatepaille, Paul Égré

Among news disorders, propagandist news are particularly insidious, because they tend to mix oriented messages with factual reports intended to look like reliable news. To detect propaganda, extant approaches based on Language Models such as BERT are promising but often overfit their training datasets, due to biases in data collection. To enhance classification robustness and improve generalization to new sources, we propose a neurosymbolic approach combining non-contextual text embeddings (fastText) with symbolic conceptual features such as genre, topic, and persuasion techniques. Results show improvements over equivalent text-only methods, and ablation studies as well as explainability analyses confirm the benefits of the added features. Keywords: Information disorder, Fake news, Propaganda, Classification, Topic modeling, Hybrid method, Neurosymbolic model, Ablation, Robustness

摘要:在新聞失序中,宣傳新聞特別隱蔽,因為它們往往將導向性信息與看似可靠的事實報導混合在一起。要檢測宣傳,基於語言模型(如BERT)的現有方法是有前景的,但由於數據收集中的偏見,這些方法往往會過度擬合其訓練數據集。為了增強分類的穩健性並改善對新來源的泛化,我們提出了一種神經符號方法,將非上下文文本嵌入(fastText)與符號概念特徵(如類型、主題和說服技術)結合起來。結果顯示,相較於等效的僅文本方法有改善,消融研究以及可解釋性分析確認了新增特徵的好處。
關鍵詞:信息失序、假新聞、宣傳、分類、主題建模、混合方法、神經符號模型、消融、穩健性

Quantum-Inspired Geometric Classification with Correlation Group Structures and VQC Decision Modeling

2604.01930v1 by Nishikanta Mohanty, Arya Ansuman Priyadarshi, Bikash K. Behera, Badshah Mukherjee

We propose a geometry-driven quantum-inspired classification framework that integrates Correlation Group Structures (CGR), compact SWAP-test-based overlap estimation, and selective variational quantum decision modelling. Rather than directly approximating class posteriors, the method adopts a geometry-first paradigm in which samples are evaluated relative to class medoids using overlap-derived Euclidean-like and angular similarity channels. CGR organizes features into anchor-centered correlation neighbourhoods, generating nonlinear, correlation-weighted representations that enhance robustness in heterogeneous tabular spaces. These geometric signals are fused through a non-probabilistic margin-based fusion score, serving as a lightweight and data-efficient primary classifier for small-to-moderate datasets. On Heart Disease, Breast Cancer, and Wine Quality datasets, the fusion-score classifier achieves 0.8478, 0.8881, and 0.9556 test accuracy respectively, with macro-F1 scores of 0.8463, 0.8703, and 0.9522, demonstrating competitive and stable performance relative to classical baselines. For large-scale and highly imbalanced regimes, we construct compact Delta-distance contrastive features and train a variational quantum classifier (VQC) as a nonlinear refinement layer. On the Credit Card Fraud dataset (0.17% prevalence), the Delta + VQC pipeline achieves approximately 0.85 minority recall at an alert rate of approximately 1.31%, with ROC-AUC 0.9249 and PR-AUC 0.3251 under full-dataset evaluation. These results highlight the importance of operating-point-aware assessment in rare-event detection and demonstrate that the proposed hybrid geometric-variational framework provides interpretable, scalable, and regime-adaptive classification across heterogeneous data settings.

摘要:我們提出了一個以幾何為驅動的量子啟發分類框架,該框架整合了相關性群組結構 (CGR)、基於 SWAP 測試的緊湊重疊估計,以及選擇性變分量子決策建模。該方法不直接近似類別後驗,而是採用幾何優先的範式,其中樣本相對於類別中位數進行評估,使用基於重疊的歐幾里得類似性和角度相似性通道。CGR 將特徵組織成以錨點為中心的相關性鄰域,生成非線性、相關性加權的表示,增強在異質表格空間中的穩健性。這些幾何信號通過非概率邊際融合分數進行融合,作為小到中等數據集的輕量級和數據高效的主要分類器。在心臟病、乳腺癌和葡萄酒質量數據集上,融合分數分類器分別達到 0.8478、0.8881 和 0.9556 的測試準確率,宏觀 F1 分數為 0.8463、0.8703 和 0.9522,顯示出相對於經典基準的競爭性和穩定性表現。對於大規模和高度不平衡的情況,我們構建了緊湊的 Delta 距離對比特徵,並訓練了一個變分量子分類器 (VQC) 作為非線性精煉層。在信用卡欺詐數據集(0.17% 的流行率)上,Delta + VQC 流程在約 1.31% 的警報率下實現了約 0.85 的少數回憶率,並在全數據集評估下達到 ROC-AUC 0.9249 和 PR-AUC 0.3251。這些結果突顯了在稀有事件檢測中操作點感知評估的重要性,並表明所提出的混合幾何-變分框架在異質數據設置中提供了可解釋的、可擴展的和適應性分類。

Woosh: A Sound Effects Foundation Model

2604.01929v1 by Gaëtan Hadjeres, Marc Ferras, Khaled Koutini, Benno Weck, Alexandre Bittar, Thomas Hummel, Zineb Lahrici, Hakim Missoum, Joan Serrà, Yuki Mitsufuji

The audio research community depends on open generative models as foundational tools for building novel approaches and establishing baselines. In this report, we present Woosh, Sony AI's publicly released sound effect foundation model, detailing its architecture, training process, and an evaluation against other popular open models. Being optimized for sound effects, we provide (1) a high-quality audio encoder/decoder model and (2) a text-audio alignment model for conditioning, together with (3) text-to-audio and (4) video-to-audio generative models. Distilled text-to-audio and video-to-audio models are also included in the release, allowing for low-resource operation and fast inference. Our evaluation on both public and private data shows competitive or better performance for each module when compared to existing open alternatives like StableAudio-Open and TangoFlux. Inference code and model weights are available at https://github.com/SonyResearch/Woosh. Demo samples can be found at https://sonyresearch.github.io/Woosh/.

摘要:音頻研究社群依賴開放的生成模型作為建立新方法和確立基準的基礎工具。在本報告中,我們介紹了Woosh,Sony AI公開發布的音效基礎模型,詳細說明了其架構、訓練過程以及與其他流行開放模型的評估。為了優化音效,我們提供了(1) 一個高品質的音頻編碼器/解碼器模型和(2) 一個用於條件的文本-音頻對齊模型,以及(3) 文本到音頻和(4) 影片到音頻的生成模型。提煉的文本到音頻和影片到音頻模型也包含在發布中,允許低資源運行和快速推斷。我們在公共和私人數據上的評估顯示,與現有的開放替代品如StableAudio-Open和TangoFlux相比,每個模塊的性能都具有競爭力或更佳。推斷代碼和模型權重可在 https://github.com/SonyResearch/Woosh 獲得。演示樣本可在 https://sonyresearch.github.io/Woosh/ 找到。

ImplicitBBQ: Benchmarking Implicit Bias in Large Language Models through Characteristic Based Cues

2604.01925v1 by Bhaskara Hanuma Vedula, Darshan Anghan, Ishita Goyal, Ponnurangam Kumaraguru, Abhijnan Chakraborty

Large Language Models increasingly suppress biased outputs when demographic identity is stated explicitly, yet may still exhibit implicit biases when identity is conveyed indirectly. Existing benchmarks use name based proxies to detect implicit biases, which carry weak associations with many social demographics and cannot extend to dimensions like age or socioeconomic status. We introduce ImplicitBBQ, a QA benchmark that evaluates implicit bias through characteristic based cues, culturally associated attributes that signal implicitly, across age, gender, region, religion, caste, and socioeconomic status. Evaluating 11 models, we find that implicit bias in ambiguous contexts is over six times higher than explicit bias in open weight models. Safety prompting and chain-of-thought reasoning fail to substantially close this gap; even few-shot prompting, which reduces implicit bias by 84%, leaves caste bias at four times the level of any other dimension. These findings indicate that current alignment and prompting strategies address the surface of bias evaluation while leaving culturally grounded stereotypic associations largely unresolved. We publicly release our code and dataset for model providers and researchers to benchmark potential mitigation techniques.

摘要:大型語言模型在明確表述人口身份時越來越能抑制偏見輸出,但在間接傳達身份時仍可能表現出隱性偏見。現有基準使用基於姓名的代理來檢測隱性偏見,這與許多社會人口特徵的關聯性較弱,無法擴展到年齡或社會經濟地位等維度。我們引入了ImplicitBBQ,一個通過特徵基準提示來評估隱性偏見的問答基準,這些特徵是與年齡、性別、地區、宗教、種姓和社會經濟地位隱性相關的文化屬性。在評估11個模型時,我們發現,在模糊上下文中的隱性偏見比開放權重模型中的明確偏見高出六倍以上。安全提示和思維鏈推理未能實質性地縮小這一差距;即使是少量提示,雖然將隱性偏見降低了84%,但種姓偏見仍是其他任何維度的四倍。這些發現表明,當前的對齊和提示策略僅觸及偏見評估的表面,而未能大幅解決文化根植的刻板印象關聯。我們公開釋放我們的代碼和數據集,以供模型提供者和研究人員基準潛在的緩解技術。

Is Clinical Text Enough? A Multimodal Study on Mortality Prediction in Heart Failure Patients

2604.01924v1 by Oumaima El Khettari, Virgile Barthet, Guillaume Hocquet, Joconde Weller, Emmanuel Morin, Pierre Zweigenbaum

Accurate short-term mortality prediction in heart failure (HF) remains challenging, particularly when relying on structured electronic health record (EHR) data alone. We evaluate transformer-based models on a French HF cohort, comparing text-only, structured-only, multimodal, and LLM-based approaches. Our results show that enriching clinical text with entity-level representations improves prediction over CLS embeddings alone, and that supervised multimodal fusion of text and structured variables achieves the best overall performance. In contrast, large language models perform inconsistently across modalities and decoding strategies, with text-only prompts outperforming structured or multimodal inputs. These findings highlight that entity-aware multimodal transformers offer the most reliable solution for short-term HF outcome prediction, while current LLM prompting remains limited for clinical decision support.

摘要:準確的心臟衰竭(HF)短期死亡率預測仍然具有挑戰性,特別是當僅依賴結構化電子健康記錄(EHR)數據時。我們在一個法國HF隊列中評估基於Transformer的模型,並比較僅文本、僅結構、跨模態和基於LLM的方法。我們的結果顯示,使用實體級別的表示來豐富臨床文本能夠改善僅使用CLS嵌入的預測,並且監督式的文本和結構變量的跨模態融合達到了最佳的整體表現。相比之下,大型語言模型在不同模態和解碼策略中的表現不一致,僅文本的提示優於結構化或跨模態輸入。這些發現突顯出,具有實體感知的跨模態Transformer為短期HF結果預測提供了最可靠的解決方案,而當前的LLM提示在臨床決策支持方面仍然有限。

SURE: Synergistic Uncertainty-aware Reasoning for Multimodal Emotion Recognition in Conversations

2604.01916v1 by Yiqiang Cai, Chengyan Wu, Bolei Ma, Bo Chen, Yun Xue, Julia Hirschberg, Ziwei Gong

Multimodal emotion recognition in conversations (MERC) requires integrating multimodal signals while being robust to noise and modeling contextual reasoning. Existing approaches often emphasize fusion but overlook uncertainty in noisy features and fine-grained reasoning. We propose SURE (Synergistic Uncertainty-aware REasoning) for MERC, a framework that improves robustness and contextual modeling. SURE consists of three components: an Uncertainty-Aware Mixture-of-Experts module to handle modality-specific noise, an Iterative Reasoning module for multi-turn reasoning over context, and a Transformer Gate module to capture intra- and inter-modal interactions. Experiments on benchmark MERC datasets show that SURE consistently outperforms state-of-the-art methods, demonstrating its effectiveness in robust multimodal reasoning. These results highlight the importance of uncertainty modeling and iterative reasoning in advancing emotion recognition in conversational settings.

摘要:多模態情感識別在對話中(MERC)需要整合多模態信號,同時對噪聲具有魯棒性並建模上下文推理。現有的方法通常強調融合,但忽視了噪聲特徵中的不確定性和細緻的推理。我們提出了SURE(協同不確定性感知推理)用於MERC,這是一個改善魯棒性和上下文建模的框架。SURE由三個組件組成:一個不確定性感知的專家混合模塊,用於處理特定模態的噪聲;一個迭代推理模塊,用於對上下文進行多輪推理;以及一個Transformer閘模塊,用於捕捉模態內和模態間的交互。在基準MERC數據集上的實驗顯示,SURE始終超越最先進的方法,證明了其在魯棒多模態推理中的有效性。這些結果突顯了不確定性建模和迭代推理在推進對話環境中的情感識別中的重要性。

Lifting Unlabeled Internet-level Data for 3D Scene Understanding

2604.01907v1 by Yixin Chen, Yaowei Zhang, Huangyue Yu, Junchao He, Yan Wang, Jiangyong Huang, Hongyu Shen, Junfeng Ni, Shaofei Wang, Baoxiong Jia, Song-Chun Zhu, Siyuan Huang

Annotated 3D scene data is scarce and expensive to acquire, while abundant unlabeled videos are readily available on the internet. In this paper, we demonstrate that carefully designed data engines can leverage web-curated, unlabeled videos to automatically generate training data, to facilitate end-to-end models in 3D scene understanding alongside human-annotated datasets. We identify and analyze bottlenecks in automated data generation, revealing critical factors that determine the efficiency and effectiveness of learning from unlabeled data. To validate our approach across different perception granularities, we evaluate on three tasks spanning low-level perception, i.e., 3D object detection and instance segmentation, to high-evel reasoning, i.e., 3D spatial Visual Question Answering (VQA) and Vision-Lanugage Navigation (VLN). Models trained on our generated data demonstrate strong zero-shot performance and show further improvement after finetuning. This demonstrates the viability of leveraging readily available web data as a path toward more capable scene understanding systems.

摘要:註解的 3D 場景數據稀缺且獲取成本高昂,而網路上卻有大量未標記的視頻可供使用。在本文中,我們展示了精心設計的數據引擎如何利用網路策劃的未標記視頻自動生成訓練數據,以促進端到端模型在 3D 場景理解中的應用,並與人類標註的數據集相輔相成。我們識別並分析了自動數據生成中的瓶頸,揭示了影響從未標記數據學習的效率和效果的關鍵因素。為了在不同的感知粒度上驗證我們的方法,我們在三個任務上進行評估,這些任務涵蓋了低級感知,即 3D 物體檢測和實例分割,以及高級推理,即 3D 空間視覺問答 (VQA) 和視覺-語言導航 (VLN)。在我們生成的數據上訓練的模型展示了強大的零樣本性能,並在微調後顯示出進一步的改進。這證明了利用 readily available web data 作為更強大場景理解系統的途徑的可行性。

Combating Data Laundering in LLM Training

2604.01904v1 by Muxing Li, Zesheng Ye, Sharon Li, Feng Liu

Data rights owners can detect unauthorized data use in large language model (LLM) training by querying with proprietary samples. Often, superior performance (e.g., higher confidence or lower loss) on a sample relative to the untrained data implies it was part of the training corpus, as LLMs tend to perform better on data they have seen during training. However, this detection becomes fragile under data laundering, a practice of transforming the stylistic form of proprietary data, while preserving critical information to obfuscate data provenance. When an LLM is trained exclusively on such laundered variants, it no longer performs better on originals, erasing the signals that standard detections rely on. We counter this by inferring the unknown laundering transformation from black-box access to the target LLM and, via an auxiliary LLM, synthesizing queries that mimic the laundered data, even if rights owners have only the originals. As the search space of finding true laundering transformations is infinite, we abstract such a process into a high-level transformation goal (e.g., "lyrical rewriting") and concrete details (e.g., "with vivid imagery"), and introduce synthesis data reversion (SDR) that instantiates this abstraction. SDR first identifies the most probable goal for synthesis to narrow the search; it then iteratively refines details so that synthesized queries gradually elicit stronger detection signals from the target LLM. Evaluated on the MIMIR benchmark against diverse laundering practices and target LLM families (Pythia, Llama2, and Falcon), SDR consistently strengthens data misuse detection, providing a practical countermeasure to data laundering.

摘要:資料權利擁有者可以透過使用專有樣本查詢,檢測大型語言模型(LLM)訓練中的未經授權數據使用。通常,相對於未訓練數據,對某個樣本的優越表現(例如,更高的信心或更低的損失)意味著它是訓練語料庫的一部分,因為LLM在訓練期間見過的數據上表現通常更好。然而,這種檢測在數據洗白的情況下變得脆弱,數據洗白是一種轉變專有數據的風格形式的做法,同時保留關鍵信息以模糊數據來源。當LLM僅在這些洗白變體上訓練時,它在原始數據上的表現不再優於洗白數據,抹去了標準檢測所依賴的信號。我們通過從對目標LLM的黑箱訪問中推斷未知的洗白轉換,並通過輔助LLM合成模仿洗白數據的查詢來應對,即使權利擁有者只有原始數據。由於尋找真實洗白轉換的搜索空間是無限的,我們將這一過程抽象為高層次的轉換目標(例如,“抒情重寫”)和具體細節(例如,“以生動的意象”),並引入合成數據反轉(SDR)來具體化這一抽象。SDR首先識別最可能的合成目標以縮小搜索範圍;然後,它迭代地細化細節,使合成查詢逐漸引發目標LLM更強的檢測信號。在MIMIR基準上針對多樣的洗白做法和目標LLM系列(Pythia、Llama2和Falcon)進行評估,SDR始終增強數據濫用檢測,提供了一種對抗數據洗白的實用對策。

Bayesian Elicitation with LLMs: Model Size Helps, Extra "Reasoning" Doesn't Always

2604.01896v1 by Luka Hobor, Mario Brcic, Mihael Kovac, Kristijan Poje

Large language models (LLMs) have been proposed as alternatives to human experts for estimating unknown quantities with associated uncertainty, a process known as Bayesian elicitation. We test this by asking eleven LLMs to estimate population statistics, such as health prevalence rates, personality trait distributions, and labor market figures, and to express their uncertainty as 95\% credible intervals. We vary each model's reasoning effort (low, medium, high) to test whether more "thinking" improves results. Our findings reveal three key results. First, larger, more capable models produce more accurate estimates, but increasing reasoning effort provides no consistent benefit. Second, all models are severely overconfident: their 95\% intervals contain the true value only 9--44\% of the time, far below the expected 95\%. Third, a statistical recalibration technique called conformal prediction can correct this overconfidence, expanding the intervals to achieve the intended coverage. In a preliminary experiment, giving models web search access degraded predictions for already-accurate models, while modestly improving predictions for weaker ones. Models performed well on commonly discussed topics but struggled with specialized health data. These results indicate that LLM uncertainty estimates require statistical correction before they can be used in decision-making.

摘要:大型語言模型(LLMs)被提出作為人類專家在估計與不確定性相關的未知數量的替代方案,這個過程被稱為貝葉斯引導。我們通過要求十一個LLM估計人口統計數據,例如健康流行率、個性特徵分佈和勞動市場數據,並將其不確定性表達為95\%可信區間,來測試這一點。我們變化每個模型的推理努力(低、中、高)以測試更多的“思考”是否能改善結果。我們的研究結果揭示了三個關鍵結果。首先,較大、能力更強的模型產生更準確的估計,但增加推理努力並未提供一致的好處。其次,所有模型都過於自信:它們的95\%區間僅在9--44\%的情況下包含真實值,遠低於預期的95\%。第三,一種稱為符合預測的統計重新校準技術可以糾正這種過度自信,擴大區間以實現預期的覆蓋率。在一個初步實驗中,給模型提供網絡搜索訪問權限使得已經準確的模型的預測變差,而對較弱的模型則有適度的改善。模型在常見話題上表現良好,但在專門的健康數據上則掙扎。這些結果表明,LLM的不確定性估計在用於決策之前需要進行統計校正。

HieraVid: Hierarchical Token Pruning for Fast Video Large Language Models

2604.01881v1 by Yansong Guo, Chaoyang Zhu, Jiayi Ji, Jianghang Lin, Liujuan Cao

Video Large Language Models (VideoLLMs) have demonstrated impressive capabilities in video understanding, yet the massive number of input video tokens incurs a significant computational burden for deployment. Existing methods mainly prune video tokens at input level while neglecting the inherent information structure embedded in videos and large language models (LLMs). To address this, we propose HieraVid, a hierarchical pruning framework that progressively and dynamically reduces visual redundancy. Based on two observations that videos possess the segment-frame structure and LLMs internally propagate multi-modal information unidirectionally, we decompose pruning into three levels: 1) segment-level, where video tokens are first temporally segmented and spatially merged; 2) frame-level, where similar frames within the same segment are jointly pruned to preserve diversity; 3) layer-level, redundancy gradually shrinks as LLM layer increases w/o compromising performance. We conduct extensive experiments on four widely used video understanding benchmarks to comprehensively evaluate the effectiveness of HieraVid. Remarkably, with only 30% of tokens retained, HieraVid achieves new state-of-the-art performance, while maintaining over 98% and 99% of the performance of LLaVA-Video-7B and LLaVA-OneVision-7B, respectively.

摘要:視頻大型語言模型(VideoLLMs)在視頻理解方面展示了令人印象深刻的能力,但大量的輸入視頻標記對於部署造成了顯著的計算負擔。現有的方法主要在輸入層面修剪視頻標記,卻忽略了視頻和大型語言模型(LLMs)中固有的信息結構。為了解決這一問題,我們提出了HieraVid,一個層次化的修剪框架,逐步且動態地減少視覺冗餘。基於兩個觀察,即視頻具有段-幀結構,並且LLMs內部單向傳播多模態信息,我們將修剪分解為三個層次:1)段級,首先對視頻標記進行時間分段和空間合併;2)幀級,在同一段內共同修剪相似的幀以保持多樣性;3)層級,隨著LLM層數的增加,冗餘逐漸減少而不妨礙性能。我們在四個廣泛使用的視頻理解基準上進行了廣泛的實驗,以全面評估HieraVid的有效性。值得注意的是,僅保留30%的標記,HieraVid便達到了新的最先進性能,同時分別保持了LLaVA-Video-7B和LLaVA-OneVision-7B超過98%和99%的性能。

Beyond Detection: Ethical Foundations for Automated Dyslexic Error Attribution

2604.01853v1 by Samuel Rose, Debarati Chakraborty

Dyslexic spelling errors exhibit systematic phonological and orthographic patterns that distinguish them from the errors produced by typically developing writers. While this observation has motivated dyslexic-specific spell-checking and assistive writing tools, prior work has focused predominantly on error correction rather than attribution, and has largely neglected the ethical risks. The risk of harmful labelling, covert screening, algorithmic bias, and institutional misuse that automated classification of learners entails requires the development of robust ethical and legal frameworks for research in this area. This paper addresses both gaps. We formulate dyslexic error attribution as a binary classification task. Given a misspelt word and its correct target form, determine whether the error pattern is characteristic of a dyslexic or non-dyslexic writer. We develop a comprehensive feature set capturing orthographic, phonological, and morphological properties of each error, and propose a twin-input neural model evaluated against traditional machine learning baselines under writer-independent conditions. The neural model achieves 93.01% accuracy and an F1-score of 94.01%, with phonetically plausible errors and vowel confusions emerging as the strongest attribution signals. We situate these technical results within an explicit ethics-first framework, analysing fairness across subgroups, the interpretability requirements of educational deployment, and the conditions, consent, transparency, human oversight, and recourse, under which a system could be responsibly used. We provide concrete guidelines for ethical deployment and an open discussion of the systems limitations and misuse potential. Our results demonstrate that dyslexic error attribution is feasible at high accuracy while underscoring that feasibility alone is insufficient for deployment in high-stakes educational contexts.

摘要:失讀症的拼寫錯誤展現出系統性的語音和正字法模式,使其與典型發展的寫作者所產生的錯誤區分開來。雖然這一觀察促使了針對失讀症的特定拼寫檢查和輔助寫作工具的開發,但先前的工作主要集中在錯誤修正而非歸因,並且在很大程度上忽視了倫理風險。自動化學習者分類所帶來的有害標籤、隱性篩選、算法偏見和機構濫用的風險,需要為該領域的研究制定健全的倫理和法律框架。本文針對這兩個空白進行探討。我們將失讀症錯誤歸因表述為一個二元分類任務。給定一個拼寫錯誤的單詞及其正確的目標形式,判斷該錯誤模式是否是失讀症或非失讀症寫作者的特徵。我們開發了一套全面的特徵集,捕捉每個錯誤的正字法、語音學和形態學特性,並提出了一種雙輸入神經模型,該模型在獨立於寫作者的條件下,與傳統機器學習基準進行評估。該神經模型達到了93.01%的準確率和94.01%的F1-score,語音上合理的錯誤和元音混淆成為最強的歸因信號。我們將這些技術結果置於明確的倫理優先框架內,分析不同子群體之間的公平性、教育部署的可解釋性要求,以及系統在何種條件、同意、透明度、人類監督和救濟下可以負責任地使用。我們提供了具體的倫理部署指南以及對系統限制和濫用潛力的公開討論。我們的結果表明,失讀症錯誤歸因在高準確率下是可行的,同時強調僅有可行性不足以在高風險的教育環境中進行部署。

From Guessing to Placeholding: A Cost-Theoretic Framework for Uncertainty-Aware Code Completion

2604.01849v1 by Liang Zhu, Haolin Chen, Lidong Zhao, Xian Wu

While Large Language Models (LLMs) have demonstrated exceptional proficiency in code completion, they typically adhere to a Hard Completion (HC) paradigm, compelling the generation of fully concrete code even amidst insufficient context. Our analysis of 3 million real-world interactions exposes the limitations of this strategy: 61% of the generated suggestions were either edited after acceptance or rejected despite exhibiting over 80% similarity to the user's subsequent code, suggesting that models frequently make erroneous predictions at specific token positions. Motivated by this observation, we propose Adaptive Placeholder Completion (APC), a collaborative framework that extends HC by strategically outputting explicit placeholders at high-entropy positions, allowing users to fill directly via IDE navigation. Theoretically, we formulate code completion as a cost-minimization problem under uncertainty. Premised on the observation that filling placeholders incurs lower cost than correcting errors, we prove the existence of a critical entropy threshold above which APC achieves strictly lower expected cost than HC. We instantiate this framework by constructing training data from filtered real-world edit logs and design a cost-based reward function for reinforcement learning. Extensive evaluations across 1.5B--14B parameter models demonstrate that APC reduces expected editing costs from 19% to 50% while preserving standard HC performance. Our work provides both a theoretical foundation and a practical training framework for uncertainty-aware code completion, demonstrating that adaptive abstention can be learned end-to-end without sacrificing conventional completion quality.

摘要:大型語言模型(LLMs)在代碼補全方面展現出卓越的能力,但它們通常遵循硬性補全(HC)範式,強迫生成完全具體的代碼,即使在上下文不足的情況下也是如此。對 300 萬次真實互動的分析揭示了這一策略的局限性:61% 的生成建議在被接受後被編輯或被拒絕,儘管它們與用戶隨後的代碼有超過 80% 的相似性,這表明模型在特定的標記位置經常做出錯誤的預測。受到這一觀察的啟發,我們提出了自適應佔位符補全(APC),這是一個協作框架,通過在高熵位置戰略性地輸出明確的佔位符來擴展 HC,允許用戶通過 IDE 導航直接填寫。從理論上講,我們將代碼補全形式化為一個不確定性下的成本最小化問題。基於填寫佔位符的成本低於糾正錯誤的觀察,我們證明了存在一個關鍵熵閾值,超過該閾值時,APC 的期望成本明顯低於 HC。我們通過從過濾的真實編輯日誌中構建訓練數據來實現這一框架,並為強化學習設計了一個基於成本的獎勵函數。在 15 億到 140 億參數模型上的廣泛評估表明,APC 將期望編輯成本從 19% 降低到 50%,同時保持標準 HC 的性能。我們的工作為不確定性感知的代碼補全提供了理論基礎和實用的訓練框架,展示了自適應放棄可以端到端學習,而不犧牲傳統補全質量。

CANDI: Curated Test-Time Adaptation for Multivariate Time-Series Anomaly Detection Under Distribution Shift

2604.01845v1 by HyunGi Kim, Jisoo Mok, Hyungyu Lee, Juhyeon Shin, Sungroh Yoon

Multivariate time-series anomaly detection (MTSAD) aims to identify deviations from normality in multivariate time-series and is critical in real-world applications. However, in real-world deployments, distribution shifts are ubiquitous and cause severe performance degradation in pre-trained anomaly detector. Test-time adaptation (TTA) updates a pre-trained model on-the-fly using only unlabeled test data, making it promising for addressing this challenge. In this study, we propose CANDI (Curated test-time adaptation for multivariate time-series ANomaly detection under DIstribution shift), a novel TTA framework that selectively identifies and adapts to potential false positives while preserving pre-trained knowledge. CANDI introduces a False Positive Mining (FPM) strategy to curate adaptation samples based on anomaly scores and latent similarity, and incorporates a plug-and-play Spatiotemporally-Aware Normality Adaptation (SANA) module for structurally informed model updates. Extensive experiments demonstrate that CANDI significantly improves the performance of MTSAD under distribution shift, improving AUROC up to 14% while using fewer adaptation samples.

摘要:多變量時間序列異常檢測(MTSAD)旨在識別多變量時間序列中的正常性偏差,並在現實應用中至關重要。然而,在現實部署中,分佈變化無處不在,並導致預訓練異常檢測器的性能嚴重下降。測試時適應(TTA)僅使用未標記的測試數據即時更新預訓練模型,使其在應對這一挑戰時顯得前景可期。在本研究中,我們提出了CANDI(針對分佈變化的多變量時間序列異常檢測的精選測試時適應),這是一個新穎的TTA框架,能夠選擇性地識別和適應潛在的假陽性,同時保留預訓練知識。CANDI引入了一種假陽性挖掘(FPM)策略,根據異常分數和潛在相似性來篩選適應樣本,並結合了一個即插即用的時空感知正常性適應(SANA)模塊,以進行結構性知識更新。大量實驗表明,CANDI在分佈變化下顯著提高了MTSAD的性能,AUROC提高了多達14%,同時使用了更少的適應樣本。

Retrieval-aligned Tabular Foundation Models Enable Robust Clinical Risk Prediction in Electronic Health Records Under Real-world Constraints

2604.01841v1 by Minh-Khoi Pham, Thang-Long Nguyen Ho, Thao Thi Phuong Dao, Tai Tan Mai, Minh-Triet Tran, Marie E. Ward, Una Geary, Rob Brennan, Nick McDonald, Martin Crane, Marija Bezbradica

Clinical prediction from structured electronic health records (EHRs) is challenging due to high dimensionality, heterogeneity, class imbalance, and distribution shift. While tabular in-context learning (TICL) and retrieval-augmented methods perform well on generic benchmarks, their behavior in clinical settings remains unclear. We present a multi-cohort EHR benchmark comparing classical, deep tabular, and TICL models across varying data scale, feature dimensionality, outcome rarity, and cross-cohort generalization. PFN-based TICL models are sample-efficient in low-data regimes but degrade under naive distance-based retrieval as heterogeneity and imbalance increase. We propose AWARE, a task-aligned retrieval framework using supervised embedding learning and lightweight adapters. AWARE improves AUPRC by up to 12.2% under extreme imbalance, with gains increasing with data complexity. Our results identify retrieval quality and retrieval-inference alignment as key bottlenecks for deploying tabular in-context learning in clinical prediction.

摘要:臨床預測來自結構化電子健康紀錄(EHRs)是具有挑戰性的,因為它們具有高維度性、異質性、類別不平衡和分佈轉移。雖然表格內文學習(TICL)和檢索增強方法在通用基準上表現良好,但它們在臨床環境中的行為仍不明確。我們提出了一個多隊列EHR基準,比較了傳統模型、深度表格模型和TICL模型在不同數據規模、特徵維度、結果稀有性和跨隊列泛化方面的表現。基於PFN的TICL模型在低數據環境中樣本效率高,但在異質性和不平衡性增加時,簡單的基於距離的檢索會導致性能下降。我們提出了AWARE,一個與任務對齊的檢索框架,使用監督式嵌入學習和輕量級適配器。在極端不平衡的情況下,AWARE將AUPRC提高了多達12.2%,並且隨著數據複雜性的增加而增長。我們的結果確定了檢索質量和檢索推理對齊是將表格內文學習應用於臨床預測的關鍵瓶頸。

Not All Tokens See Equally: Perception-Grounded Policy Optimization for Large Vision-Language Models

2604.01840v1 by Zekai Ye, Qiming Li, Xiaocheng Feng, Ruihan Chen, Ziming Li, Haoyu Ren, Kun Chen, Dandan Tu, Bing Qin

While Reinforcement Learning from Verifiable Rewards (RLVR) has advanced reasoning in Large Vision-Language Models (LVLMs), prevailing frameworks suffer from a foundational methodological flaw: by distributing identical advantages across all generated tokens, these methods inherently dilute the learning signals essential for optimizing the critical, visually-grounded steps of multimodal reasoning. To bridge this gap, we formulate \textit{Token Visual Dependency}, quantifying the causal information gain of visual inputs via the Kullback-Leibler (KL) divergence between visual-conditioned and text-only predictive distributions. Revealing that this dependency is highly sparse and semantically pivotal, we introduce Perception-Grounded Policy Optimization (PGPO), which is a novel fine-grained credit assignment framework that dynamically reshapes advantages at the token level. Through a threshold-gated, mass-conserving mechanism, PGPO actively amplifies learning signals for visually-dependent tokens while suppressing gradient noise from linguistic priors. Extensive experiments based on the Qwen2.5-VL series across seven challenging multimodal reasoning benchmarks demonstrate that PGPO boosts models by 18.7% on average. Both theoretical and empirical analyses confirm that PGPO effectively reduces gradient variance, prevents training collapse, and acts as a potent regularizer for robust, perception-grounded multimodal reasoning. Code will be published on https://github.com/Yzk1114/PGPO.

摘要:雖然來自可驗證獎勵的強化學習(RLVR)在大型視覺語言模型(LVLMs)中推進了推理,但現有框架存在一個基本的 методологический 缺陷:通過在所有生成的標記中分配相同的優勢,這些方法本質上稀釋了對於優化多模態推理關鍵的、視覺基礎步驟所必需的學習信號。為了填補這一空白,我們制定了\textit{標記視覺依賴性},通過視覺條件和僅文本預測分佈之間的Kullback-Leibler(KL)散度來量化視覺輸入的因果信息增益。揭示這種依賴性高度稀疏且在語義上至關重要,我們引入了感知基礎的策略優化(PGPO),這是一種新穎的細粒度信用分配框架,能夠在標記層面動態重塑優勢。通過一種閾值門控的質量守恆機制,PGPO主動放大視覺依賴標記的學習信號,同時抑制來自語言先驗的梯度噪聲。基於Qwen2.5-VL系列的廣泛實驗,在七個具有挑戰性的多模態推理基準上顯示,PGPO平均提升模型性能18.7%。理論和實證分析均確認PGPO有效降低梯度方差,防止訓練崩潰,並作為一種強大的正則化器,促進穩健的、基於感知的多模態推理。代碼將在https://github.com/Yzk1114/PGPO上發布。

PLOT: Enhancing Preference Learning via Optimal Transport

2604.01837v1 by Liang Zhu, Yuelin Bai, Xiankun Ren, Jiaxi Yang, Lei Zhang, Feiteng Fang, Hamid Alinejad-Rokny, Minghuan Tan, Min Yang

Preference learning in Large Language Models (LLMs) has advanced significantly, yet existing methods remain limited by modest performance gains, high computational costs, hyperparameter sensitivity, and insufficient modeling of global token-level relationships. We introduce PLOT, which enhances Preference Learning in fine-tuning-based alignment through a token-level loss derived from Optimal Transport. By formulating preference learning as an Optimal Transport Problem, PLOT aligns model outputs with human preferences while preserving the original distribution of LLMs, ensuring stability and robustness. Furthermore, PLOT leverages token embeddings to capture semantic relationships, enabling globally informed optimization. Experiments across two preference categories - Human Values and Logic & Problem Solving - spanning seven subpreferences demonstrate that PLOT consistently improves alignment performance while maintaining fluency and coherence. These results substantiate optimal transport as a principled methodology for preference learning, establishing a theoretically grounded framework that provides new insights for preference learning of LLMs.

摘要:偏好學習在大型語言模型(LLMs)中已經取得了顯著進展,但現有的方法仍然受到有限的性能提升、高計算成本、超參數敏感性以及對全局標記級關係建模不足的限制。我們介紹了PLOT,它通過來自最優傳輸的標記級損失來增強基於微調的對齊中的偏好學習。通過將偏好學習公式化為最優傳輸問題,PLOT使模型輸出與人類偏好對齊,同時保留LLMs的原始分佈,確保穩定性和魯棒性。此外,PLOT利用標記嵌入來捕捉語義關係,實現全球信息的優化。在兩個偏好類別 - 人類價值觀和邏輯與問題解決 - 涉及七個子偏好的實驗中,顯示PLOT持續改善對齊性能,同時保持流暢性和一致性。這些結果證實了最優傳輸作為偏好學習的一種原則性方法,建立了一個理論基礎的框架,為LLMs的偏好學習提供了新的見解。

Language-Pretraining-Induced Bias: A Strong Foundation for General Vision Tasks

2604.01833v1 by Yaxin Luo, Zhiqiang Shen

The ratio of outlier parameters in language pre-training models and vision pre-training models differs significantly, making cross-modality (language and vision) inherently more challenging than cross-domain adaptation. As a result, many prior studies have focused on cross-domain transfer rather than attempting to bridge language and vision modalities, assuming that language pre-trained models are unsuitable for downstream visual tasks due to disparate parameter spaces. Contrary to this assumption, we show that adding a bridge training stage as a modality adaptation learner can effectively align Large Language Model (LLM) parameters with vision tasks. Specifically, we propose a simple yet powerful solution random label bridge training that requires no manual labeling and helps LLM parameters adapt to vision foundation tasks. Moreover, our findings reveal that partial bridge training is often advantageous, as certain layers in LLMs exhibit strong foundational properties that remain beneficial even without fine-tuning for visual tasks. This surprising discovery opens up new avenues for leveraging language pre-trained parameters directly within vision models and highlights the potential of partial bridge training as a practical pathway to cross-modality adaptation.

摘要:語言預訓練模型和視覺預訓練模型中的離群參數比例差異顯著,使得跨模態(語言和視覺)本質上比跨領域適應更具挑戰性。因此,許多先前的研究專注於跨領域轉移,而不是嘗試橋接語言和視覺模態,假設語言預訓練模型不適合下游視覺任務,因為參數空間存在差異。與這一假設相反,我們展示了將橋接訓練階段作為模態適應學習者添加進來,可以有效地將大型語言模型(LLM)的參數與視覺任務對齊。具體而言,我們提出了一個簡單而強大的解決方案——隨機標籤橋接訓練,該方法不需要手動標記,並幫助LLM參數適應視覺基礎任務。此外,我們的研究發現部分橋接訓練通常是有利的,因為LLM中的某些層展現出強大的基礎特性,即使在不進行視覺任務微調的情況下仍然有益。這一驚人的發現為直接在視覺模型中利用語言預訓練參數開辟了新的途徑,並突顯了部分橋接訓練作為跨模態適應的實用途徑的潛力。

Neural Network-Assisted Model Predictive Control for Implicit Balancing

2604.01805v1 by Seyed Soroush Karimi Madahi, Kenneth Bruninx, Bert Claessens, Chris Develder

In Europe, balance responsible parties can deliberately take out-of-balance positions to support transmission system operators (TSOs) in maintaining grid stability and earn profit, a practice called implicit balancing. Model predictive control (MPC) is widely adopted as an effective approach for implicit balancing. The balancing market model accuracy in MPC is critical to decision quality. Previous studies modeled this market using either (i) a convex market clearing approximation, ignoring proactive manual actions by TSOs and the market sub-quarter-hour dynamics, or (ii) machine learning methods, which cannot be directly integrated into MPC. To address these shortcomings, we propose a data-driven balancing market model integrated into MPC using an input convex neural network to ensure convexity while capturing uncertainties. To keep the core network computationally efficient, we incorporate attention-based input gating mechanisms to remove irrelevant data. Evaluating on Belgian data shows that the proposed model both improves MPC decisions and reduces computational time.

摘要:在歐洲,平衡負責方可以故意採取失衡的頭寸,以支持傳輸系統運營商(TSOs)維持電網穩定並獲取利潤,這種做法稱為隱式平衡。模型預測控制(MPC)被廣泛採用作為隱式平衡的有效方法。MPC中平衡市場模型的準確性對決策質量至關重要。先前的研究使用(i)凸市場清算近似來建模此市場,忽略了TSOs的主動手動行動及市場的子季度小時動態,或(ii)機器學習方法,這些方法無法直接整合到MPC中。為了應對這些不足,我們提出了一種數據驅動的平衡市場模型,該模型集成到MPC中,使用輸入凸神經網絡以確保凸性,同時捕捉不確定性。為了保持核心網絡的計算效率,我們納入了基於注意力的輸入閘控機制,以去除不相關數據。在比利時數據上的評估顯示,所提出的模型不僅改善了MPC決策,還減少了計算時間。

DEFT: Distribution-guided Efficient Fine-Tuning for Human Alignment

2604.01787v1 by Liang Zhu, Feiteng Fang, Yuelin Bai, Longze Chen, Zhexiang Zhang, Minghuan Tan, Min Yang

Reinforcement Learning from Human Feedback (RLHF), using algorithms like Proximal Policy Optimization (PPO), aligns Large Language Models (LLMs) with human values but is costly and unstable. Alternatives have been proposed to replace PPO or integrate Supervised Fine-Tuning (SFT) and contrastive learning for direct fine-tuning and value alignment. However, these methods still require voluminous data to learn preferences and may weaken the generalization ability of LLMs. To further enhance alignment efficiency and performance while mitigating the loss of generalization ability, this paper introduces Distribution-guided Efficient Fine-Tuning (DEFT), an efficient alignment framework incorporating data filtering and distributional guidance by calculating the differential distribution reward based on the output distribution of language model and the discrepancy distribution of preference data. A small yet high-quality subset is filtered from the raw data using a differential distribution reward, which is then incorporated into existing alignment methods to guide the model's output distribution. Experimental results demonstrate that the methods enhanced by DEFT outperform the original methods in both alignment capability and generalization ability, with significantly reduced training time.

摘要:強化學習來自人類反饋(RLHF),使用像是近端政策優化(PPO)這樣的算法,使大型語言模型(LLMs)與人類價值觀對齊,但成本高昂且不穩定。已提出替代方案來取代PPO或整合監督微調(SFT)和對比學習,以便進行直接微調和價值對齊。然而,這些方法仍然需要大量數據來學習偏好,並可能削弱LLMs的泛化能力。為了進一步提高對齊效率和性能,同時減少泛化能力的損失,本文介紹了基於分佈的高效微調(DEFT),這是一個高效的對齊框架,通過計算基於語言模型輸出分佈和偏好數據差異分佈的微分分佈獎勵來整合數據過濾和分佈指導。使用微分分佈獎勵從原始數據中過濾出一個小而高質量的子集,然後將其納入現有的對齊方法中,以指導模型的輸出分佈。實驗結果表明,經過DEFT增強的方法在對齊能力和泛化能力上均優於原始方法,並顯著減少了訓練時間。

Taming CATS: Controllable Automatic Text Simplification through Instruction Fine-Tuning with Control Tokens

2604.01779v1 by Hanna Hubarava, Yingqiang Gao

Controllable Automatic Text Simplification (CATS) produces user-tailored outputs, yet controllability is often treated as a decoding problem and evaluated with metrics that are not reflective to the measure of control. We observe that controllability in ATS is significantly constrained by data and evaluation. To this end, we introduce a domain-agnostic CATS framework based on instruction fine-tuning with discrete control tokens, steering open-source models to target readability levels and compression rates. Across three model families with different model sizes (Llama, Mistral, Qwen; 1-14B) and four domains (medicine, public administration, news, encyclopedic text), we find that smaller models (1-3B) can be competitive, but reliable controllability strongly depends on whether the training data encodes sufficient variation in the target attribute. Readability control (FKGL, ARI, Dale-Chall) is learned consistently, whereas compression control underperforms due to limited signal variability in the existing corpora. We further show that standard simplification and similarity metrics are insufficient for measuring control, motivating error-based measures for target-output alignment. Finally, our sampling and stratification experiments demonstrate that naive splits can introduce distributional mismatch that undermines both training and evaluation.

摘要:可控自動文本簡化(CATS)產生用戶量身定制的輸出,但可控性通常被視為解碼問題,並用不反映控制程度的指標進行評估。我們觀察到,ATS中的可控性受到數據和評估的顯著限制。為此,我們介紹了一種基於指令微調和離散控制標記的領域無關的CATS框架,引導開源模型以達到目標可讀性水平和壓縮率。在三個不同模型大小的模型家族(Llama、Mistral、Qwen;1-14B)和四個領域(醫學、公共管理、新聞、百科文本)中,我們發現較小的模型(1-3B)可以具有競爭力,但可靠的可控性強烈依賴於訓練數據是否編碼了目標屬性的足夠變異性。可讀性控制(FKGL、ARI、Dale-Chall)學習得相當一致,而壓縮控制表現不佳,因為現有語料庫中的信號變異性有限。我們進一步顯示,標準簡化和相似性指標不足以衡量控制,這促使我們採用基於錯誤的指標來對齊目標輸出。最後,我們的抽樣和分層實驗表明,簡單的劃分可能會引入分佈不匹配,從而削弱訓練和評估的效果。

FSKD: Monocular Forest Structure Inference via LiDAR-to-RGBI Knowledge Distillation

2604.01766v1 by Taimur Khan, Hannes Feilhauer, Muhammad Jazib Zafar

Very High Resolution (VHR) forest structure data at individual-tree scale is essential for carbon, biodiversity, and ecosystem monitoring. Still, airborne LiDAR remains costly and infrequent despite being the reference for forest structure metrics like Canopy Height Model (CHM), Plant Area Index (PAI), and Foliage Height Diversity (FHD). We propose FSKD: a LiDAR-to-RGB-Infrared (RGBI) knowledge distillation (KD) framework in which a multi-modal teacher fuses RGBI imagery with LiDAR-derived planar metrics and vertical profiles via cross-attention, and an RGBI-only SegFormer student learns to reproduce these outputs. Trained on 384 $km^2$ of forests in Saxony, Germany (20 cm ground sampling distance (GSD)) and evaluated on eight geographically distinct test tiles, the student achieves state-of-the-art (SOTA) zero-shot CHM performance (MedAE 4.17 m, $R^2$=0.51, IoU 0.87), outperforming HRCHM/DAC baselines by 29--46% in MAE (5.81 m vs. 8.14--10.84 m) with stronger correlation coefficients (0.713 vs. 0.166--0.652). Ablations show that multi-modal fusion improves performance by 10--26% over RGBI-only training, and that asymmetric distillation with appropriate model capacity is critical. The method jointly predicts CHM, PAI, and FHD, a multi-metric capability not provided by current monocular CHM estimators, although PAI/FHD transfer remains region-dependent and benefits from local calibration. The framework also remains effective under temporal mismatch (winter LiDAR, summer RGBI), removing strict co-acquisition constraints and enabling scalable 20 cm operational monitoring for workflows such as Digital Twin Germany and national Digital Orthophoto programs.

摘要:非常高解析度 (VHR) 的森林結構數據在單棵樹的尺度上對於碳、生物多樣性和生態系統監測至關重要。儘管空中LiDAR仍然是森林結構指標(如樹冠高度模型 (CHM)、植物面積指數 (PAI) 和葉片高度多樣性 (FHD))的參考,但其成本高昂且使用頻率不高。我們提出了FSKD:一個LiDAR到RGB-紅外 (RGBI) 知識蒸餾 (KD) 框架,其中一個多模態教師通過交叉注意力將RGBI影像與LiDAR衍生的平面指標和垂直剖面融合,而僅使用RGBI的SegFormer學生則學習重現這些輸出。在德國薩克森州的384 $km^2$ 森林上進行訓練(地面取樣距離 (GSD) 為20厘米),並在八個地理上不同的測試區塊上進行評估,該學生在零樣本CHM性能上達到了最先進的 (SOTA) 表現(MedAE 4.17 m,$R^2$=0.51,IoU 0.87),在MAE方面超越了HRCHM/DAC基準29--46%(5.81 m對比8.14--10.84 m),並且具有更強的相關係數(0.713對比0.166--0.652)。消融實驗顯示,多模態融合在性能上比僅RGBI訓練提高了10--26%,而且具備適當模型容量的非對稱蒸餾是關鍵。該方法共同預測CHM、PAI和FHD,這是一種當前單目CHM估計器所不具備的多指標能力,儘管PAI/FHD的轉移仍然依賴於區域,並受益於本地校準。該框架在時間不匹配(冬季LiDAR,夏季RGBI)下仍然有效,消除了嚴格的共同獲取限制,並為數位雙胞胎德國和國家數位正射影像計畫等工作流程實現可擴展的20厘米操作監測。

DriveDreamer-Policy: A Geometry-Grounded World-Action Model for Unified Generation and Planning

2604.01765v1 by Yang Zhou, Xiaofeng Wang, Hao Shao, Letian Wang, Guosheng Zhao, Jiangnan Shao, Jiagang Zhu, Tingdong Yu, Zheng Zhu, Guan Huang, Steven L. Waslander

Recently, world-action models (WAM) have emerged to bridge vision-language-action (VLA) models and world models, unifying their reasoning and instruction-following capabilities and spatio-temporal world modeling. However, existing WAM approaches often focus on modeling 2D appearance or latent representations, with limited geometric grounding-an essential element for embodied systems operating in the physical world. We present DriveDreamer-Policy, a unified driving world-action model that integrates depth generation, future video generation, and motion planning within a single modular architecture. The model employs a large language model to process language instructions, multi-view images, and actions, followed by three lightweight generators that produce depth, future video, and actions. By learning a geometry-aware world representation and using it to guide both future prediction and planning within a unified framework, the proposed model produces more coherent imagined futures and more informed driving actions, while maintaining modularity and controllable latency. Experiments on the Navsim v1 and v2 benchmarks demonstrate that DriveDreamer-Policy achieves strong performance on both closed-loop planning and world generation tasks. In particular, our model reaches 89.2 PDMS on Navsim v1 and 88.7 EPDMS on Navsim v2, outperforming existing world-model-based approaches while producing higher-quality future video and depth predictions. Ablation studies further show that explicit depth learning provides complementary benefits to video imagination and improves planning robustness.

摘要:最近,世界行動模型(WAM)已經出現,以橋接視覺-語言-行動(VLA)模型和世界模型,統一它們的推理和遵循指令的能力以及時空世界建模。然而,現有的WAM方法往往專注於建模2D外觀或潛在表示,幾何基礎有限——這對於在物理世界中運作的具身系統來說是一個基本要素。我們提出了DriveDreamer-Policy,一個統一的駕駛世界行動模型,將深度生成、未來視頻生成和運動規劃整合在一個單一的模組化架構中。該模型使用大型語言模型來處理語言指令、多視角圖像和行動,隨後由三個輕量級生成器生成深度、未來視頻和行動。通過學習一個幾何感知的世界表示並利用它來指導統一框架內的未來預測和規劃,所提出的模型產生了更連貫的想像未來和更具信息性的駕駛行動,同時保持模組化和可控的延遲。在Navsim v1和v2基準上的實驗表明,DriveDreamer-Policy在閉環規劃和世界生成任務上都達到了強勁的表現。特別是,我們的模型在Navsim v1上達到了89.2 PDMS,在Navsim v2上達到了88.7 EPDMS,超越了現有的基於世界模型的方法,同時產生了更高質量的未來視頻和深度預測。消融研究進一步顯示,明確的深度學習為視頻想像提供了互補的好處,並提高了規劃的穩健性。

FourierMoE: Fourier Mixture-of-Experts Adaptation of Large Language Models

2604.01762v1 by Juyong Jiang, Fan Wang, Hong Qi, Sunghun Kim, Jing Tang

Parameter-efficient fine-tuning (PEFT) has emerged as a crucial paradigm for adapting large language models (LLMs) under constrained computational budgets. However, standard PEFT methods often struggle in multi-task fine-tuning settings, where diverse optimization objectives induce task interference and limited parameter budgets lead to representational deficiency. While recent approaches incorporate mixture-of-experts (MoE) to alleviate these issues, they predominantly operate in the spatial domain, which may introduce structural redundancy and parameter overhead. To overcome these limitations, we reformulate adaptation in the spectral domain. Our spectral analysis reveals that different tasks exhibit distinct frequency energy distributions, and that LLM layers display heterogeneous frequency sensitivities. Motivated by these insights, we propose FourierMoE, which integrates the MoE architecture with the inverse discrete Fourier transform (IDFT) for frequency-aware adaptation. Specifically, FourierMoE employs a frequency-adaptive router to dispatch tokens to experts specialized in distinct frequency bands. Each expert learns a set of conjugate-symmetric complex coefficients, preserving complete phase and amplitude information while theoretically guaranteeing lossless IDFT reconstruction into real-valued spatial weights. Extensive evaluations across 28 benchmarks, multiple model architectures, and scales demonstrate that FourierMoE consistently outperforms competitive baselines in both single-task and multi-task settings while using significantly fewer trainable parameters. These results highlight the promise of spectral-domain expert adaptation as an effective and parameter-efficient paradigm for LLM fine-tuning.

摘要:參數高效微調(PEFT)已成為在受限計算預算下調整大型語言模型(LLMs)的關鍵範式。然而,標準的PEFT方法在多任務微調環境中往往面臨挑戰,因為多樣的優化目標會引起任務干擾,而有限的參數預算則導致表徵不足。雖然最近的方法結合了專家混合(MoE)來緩解這些問題,但它們主要在空間域中運作,這可能會引入結構冗餘和參數開銷。為了克服這些限制,我們在頻譜域中重新定義了適應。我們的頻譜分析顯示,不同任務展現出不同的頻率能量分佈,並且LLM層顯示出異質的頻率敏感性。受到這些見解的啟發,我們提出了FourierMoE,將MoE架構與逆離散傅立葉變換(IDFT)結合,用於頻率感知的適應。具體而言,FourierMoE使用頻率自適應路由器將標記分派給專注於不同頻率帶的專家。每個專家學習一組共軛對稱的複數係數,保留完整的相位和幅度信息,同時理論上保證無損IDFT重建為實值空間權重。在28個基準、各種模型架構和規模上的廣泛評估顯示,FourierMoE在單任務和多任務環境中始終超越競爭基準,同時使用顯著較少的可訓練參數。這些結果突顯了頻譜域專家適應作為LLM微調的一種有效且參數高效的範式的潛力。

LiveMathematicianBench: A Live Benchmark for Mathematician-Level Reasoning with Proof Sketches

2604.01754v1 by Linyang He, Qiyao Yu, Hanze Dong, Baohao Liao, Xinxing Xu, Micah Goldblum, Jiang Bian, Nima Mesgarani

Mathematical reasoning is a hallmark of human intelligence, and whether large language models (LLMs) can meaningfully perform it remains a central question in artificial intelligence and cognitive science. As LLMs are increasingly integrated into scientific workflows, rigorous evaluation of their mathematical capabilities becomes a practical necessity. Existing benchmarks are limited by synthetic settings and data contamination. We present LiveMathematicianBench, a dynamic multiple-choice benchmark for research-level mathematical reasoning built from recent arXiv papers published after model training cutoffs. By grounding evaluation in newly published theorems, it provides a realistic testbed beyond memorized patterns. The benchmark introduces a thirteen-category logical taxonomy of theorem types (e.g., implication, equivalence, existence, uniqueness), enabling fine-grained evaluation across reasoning forms. It employs a proof-sketch-guided distractor pipeline that uses high-level proof strategies to construct plausible but invalid answer choices reflecting misleading proof directions, increasing sensitivity to genuine understanding over surface-level matching. We also introduce a substitution-resistant mechanism to distinguish answer recognition from substantive reasoning. Evaluation shows the benchmark is far from saturated: Gemini-3.1-pro-preview, the best model, achieves only 43.5%. Under substitution-resistant evaluation, accuracy drops sharply: GPT-5.4 scores highest at 30.6%, while Gemini-3.1-pro-preview falls to 17.6%, below the 20% random baseline. A dual-mode protocol reveals that proof-sketch access yields consistent accuracy gains, suggesting models can leverage high-level proof strategies for reasoning. Overall, LiveMathematicianBench offers a scalable, contamination-resistant testbed for studying research-level mathematical reasoning in LLMs.

摘要:數學推理是人類智慧的標誌,而大型語言模型(LLMs)是否能夠有意義地執行這一點仍然是人工智慧和認知科學中的一個核心問題。隨著LLMs越來越多地融入科學工作流程,對其數學能力的嚴格評估成為一項實際的必要性。現有的基準受到合成環境和數據污染的限制。我們提出了LiveMathematicianBench,這是一個動態的多選基準,用於研究級數學推理,基於最近在模型訓練截止日期後發表的arXiv論文。通過將評估基於新發表的定理,它提供了一個超越記憶模式的現實測試平台。該基準引入了一個包含十三類定理類型的邏輯分類法(例如,蘊涵、等價、存在性、唯一性),使得在推理形式之間的細緻評估成為可能。它採用了一個基於證明草圖的干擾選項管道,利用高級證明策略構建看似合理但無效的答案選擇,反映出誤導性的證明方向,從而提高對真正理解的敏感度,而非表面匹配。我們還引入了一個抗替代機制,以區分答案識別和實質性推理。評估顯示該基準遠未飽和:最佳模型Gemini-3.1-pro-preview僅達到43.5%。在抗替代評估下,準確率急劇下降:GPT-5.4的得分最高為30.6%,而Gemini-3.1-pro-preview降至17.6%,低於20%的隨機基線。一種雙模式協議顯示,證明草圖的訪問帶來了一致的準確性提升,這表明模型可以利用高級證明策略進行推理。總體而言,LiveMathematicianBench提供了一個可擴展的、抗污染的測試平台,用於研究LLMs中的研究級數學推理。

Detecting Toxic Language: Ontology and BERT-based Approaches for Bulgarian Text

2604.01745v1 by Melania Berbatova, Tsvetoslav Vasev

Toxic content detection in online communication remains a significant challenge, with current solutions often inadvertently blocking valuable information, including medical terms and text related to minority groups. This paper presents a more nu-anced approach to identifying toxicity in Bulgarian text while preserving access to essential information. The research explores two distinct methodologies for detecting toxic content. The developed methodologies have po-tential applications across diverse online platforms and content moderation systems. First, we propose an ontology that models the potentially toxic words in Bulgarian language. Then, we compose a dataset that comprises 4,384 manually anno-tated sentences from Bulgarian online forums across four categories: toxic language, medical terminology, non-toxic lan-guage, and terms related to minority communities. We then train a BERT-based model for toxic language classification, which reaches a 0.89 F1 macro score. The trained model is directly applicable in a real environment and can be integrated as a com-ponent of toxic content detection systems.

摘要:有毒內容檢測在在線通信中仍然是一個重大挑戰,當前的解決方案往往無意中阻止了有價值的信息,包括醫學術語和與少數群體相關的文本。本文提出了一種更細緻的方法來識別保加利亞文本中的有毒性,同時保留對重要信息的訪問。研究探討了兩種不同的方法來檢測有毒內容。所開發的方法在各種在線平台和內容審核系統中具有潛在的應用。首先,我們提出了一個本體,對保加利亞語中潛在的有毒詞彙進行建模。然後,我們組建了一個數據集,該數據集包含來自保加利亞在線論壇的4,384個手動標註的句子,分為四個類別:有毒語言、醫學術語、非有毒語言和與少數社群相關的術語。然後,我們訓練了一個基於BERT的模型來進行有毒語言分類,該模型達到了0.89的F1宏觀得分。訓練好的模型可以直接應用於實際環境中,並可以作為有毒內容檢測系統的一個組件進行整合。

AeroTherm-GPT: A Verification-Centered LLM Framework for Thermal Protection System Engineering Workflows

2604.01738v1 by Chuhan Qiao, Jinglai Zheng, Jie Huang, Buyue Zhao, Fan Li, Haiming Huang

Integrating Large Language Models (LLMs) into hypersonic thermal protection system (TPS) design is bottlenecked by cascading constraint violations when generating executable simulation artifacts. General-purpose LLMs, treating generation as single-pass text completion, fail to satisfy the sequential, multi-gate constraints inherent in safety-critical engineering workflows. To address this, we propose AeroTherm-GPT, the first TPS-specialized LLM Agent, instantiated through a Constraint-Closed-Loop Generation (CCLG) framework. CCLG organizes TPS artifact generation as an iterative workflow comprising generation, validation, CDG-guided repair, execution, and audit. The Constraint Dependency Graph (CDG) encodes empirical co-resolution structure among constraint categories, directing repair toward upstream fault candidates based on lifecycle ordering priors and empirical co-resolution probabilities. This upstream-priority mechanism resolves multiple downstream violations per action, achieving a Root-Cause Fix Efficiency of 4.16 versus 1.76 for flat-checklist repair. Evaluated on HyTPS-Bench and validated against external benchmarks, AeroTherm-GPT achieves 88.7% End-to-End Success Rate (95% CI: 87.5-89.9), a gain of +12.5 pp over the matched non-CDG ablation baseline, without catastrophic forgetting on scientific reasoning and code generation tasks.

摘要:將大型語言模型 (LLMs) 整合到超音速熱保護系統 (TPS) 設計中,因生成可執行的模擬工件時出現級聯約束違反而受到瓶頸。通用 LLMs 將生成視為單次文本完成,無法滿足安全關鍵工程工作流程中固有的序列多閘約束。為了解決這個問題,我們提出了 AeroTherm-GPT,第一個專門針對 TPS 的 LLM 代理,通過約束閉環生成 (CCLG) 框架實現。CCLG 將 TPS 工件生成組織為一個迭代工作流程,包括生成、驗證、CDG 引導的修復、執行和審核。約束依賴圖 (CDG) 編碼了約束類別之間的實證共同解決結構,根據生命週期排序的先驗和實證共同解決概率,將修復指向上游故障候選。這一上游優先機制每個行動解決多個下游違規,實現了 4.16 的根本原因修復效率,相較於 1.76 的平面清單修復。經過在 HyTPS-Bench 上評估並與外部基準驗證,AeroTherm-GPT 實現了 88.7% 的端到端成功率 (95% CI: 87.5-89.9),相比匹配的非 CDG 消融基線提高了 +12.5 個百分點,且在科學推理和代碼生成任務上沒有出現災難性遺忘。

The AnIML Ontology: Enabling Semantic Interoperability for Large-Scale Experimental Data in Interconnected Scientific Labs

2604.01728v1 by Wilf Morlidge, Elliott Watkiss-Leek, George Hannah, Harry Rostron, Andrew Ng, Ewan Johnson, Andrew Mitchell, Terry R. Payne, Valentina Tamma, Jacopo de Berardinis

Achieving semantic interoperability across heterogeneous experimental data systems remains a major barrier to data-driven scientific discovery. The Analytical Information Markup Language (AnIML), a flexible XML-based standard for analytical chemistry and biology, is increasingly used in industrial R&D labs for managing and exchanging experimental data. However, the expressivity of the XML schema permits divergent interpretations across stakeholders, introducing inconsistencies that undermine the interoperability the AnIML schema was designed to support. In this paper, we present the AnIML Ontology, an OWL 2 ontology that formalises the semantics of AnIML and aligns it with the Allotrope Data Format to support future cross-system and cross-lab interoperability. The ontology was developed using an expert-in-the-loop approach combining LLM-assisted requirement elicitation with collaborative ontology engineering. We validate the ontology through a multi-layered approach: data-driven transformation of real-world AnIML files into knowledge graphs, competency question verification via SPARQL, and a novel validation protocol based on adversarial negative competency questions mapped to established ontological anti-patterns and enforced via SHACL constraints.

摘要:實現異質實驗數據系統之間的語義互操作性仍然是數據驅動科學發現的一大障礙。分析信息標記語言(AnIML)是一種靈活的基於XML的標準,用於分析化學和生物學,越來越多地被工業研發實驗室用於管理和交換實驗數據。然而,XML架構的表達能力允許利益相關者之間存在不同的解釋,這引入了不一致性,削弱了AnIML架構所設計支持的互操作性。在本文中,我們提出了AnIML本體,一種OWL 2本體,正式化AnIML的語義並將其與Allotrope數據格式對齊,以支持未來的跨系統和跨實驗室互操作性。該本體是通過專家參與的方式開發的,結合了LLM輔助的需求引導和協作本體工程。我們通過多層次的方法驗證該本體:將現實世界的AnIML文件數據驅動地轉換為知識圖譜,通過SPARQL進行能力問題驗證,以及基於對抗性負能力問題的創新驗證協議,這些問題映射到已建立的本體反模式並通過SHACL約束強制執行。

LiteInception: A Lightweight and Interpretable Deep Learning Framework for General Aviation Fault Diagnosis

2604.01725v1 by Zhihuan Wei, Xinhang Chen, Danyang Han, Yang Hu, Jie Liu, Xuewen Miao, Guijiang Li

General aviation fault diagnosis and efficient maintenance are critical to flight safety; however, deploying deep learning models on resource-constrained edge devices poses dual challenges in computational capacity and interpretability. This paper proposes LiteInception--a lightweight interpretable fault diagnosis framework designed for edge deployment. The framework adopts a two-stage cascaded architecture aligned with standard maintenance workflows: Stage 1 performs high-recall fault detection, and Stage 2 conducts fine-grained fault classification on anomalous samples, thereby decoupling optimization objectives and enabling on-demand allocation of computational resources. For model compression, a multi-method fusion strategy based on mutual information, gradient analysis, and SE attention weights is proposed to reduce the input sensor channels from 23 to 15, and a 1+1 branch LiteInception architecture is introduced that compresses InceptionTime parameters by 70%, accelerates CPU inference by over 8x, with less than 3% F1 loss. Furthermore, knowledge distillation is introduced as a precision-recall regulation mechanism, enabling the same lightweight model to adapt to different scenarios--such as safety-critical and auxiliary diagnosis--by switching training strategies. Finally, a dual-layer interpretability framework integrating four attribution methods is constructed, providing traceable evidence chains of "which sensor x which time period." Experiments on the NGAFID dataset demonstrate a fault detection accuracy of 81.92% with 83.24% recall, and a fault identification accuracy of 77.00%, validating the framework's favorable balance among efficiency, accuracy, and interpretability.

摘要:一般航空故障診斷和高效維護對於飛行安全至關重要;然而,在資源受限的邊緣設備上部署深度學習模型面臨計算能力和可解釋性兩方面的挑戰。本文提出了LiteInception——一種設計用於邊緣部署的輕量級可解釋故障診斷框架。該框架採用與標準維護工作流程對齊的兩階段級聯架構:第一階段執行高召回率的故障檢測,第二階段對異常樣本進行細粒度故障分類,從而解耦優化目標並實現計算資源的按需分配。為了進行模型壓縮,提出了一種基於互信息、梯度分析和SE注意權重的多方法融合策略,將輸入傳感器通道從23個減少到15個,並引入了一種1+1分支的LiteInception架構,將InceptionTime參數壓縮70%,使CPU推理加速超過8倍,F1損失低於3%。此外,引入知識蒸餾作為精度-召回調節機制,使得相同的輕量級模型能夠通過切換訓練策略適應不同場景——例如安全關鍵和輔助診斷。最後,構建了一個整合四種歸因方法的雙層可解釋性框架,提供“哪個傳感器 x 哪個時間段”的可追溯證據鏈。在NGAFID數據集上的實驗顯示,故障檢測準確率為81.92%,召回率為83.24%,故障識別準確率為77.00%,驗證了該框架在效率、準確性和可解釋性之間的良好平衡。

Causal Scene Narration with Runtime Safety Supervision for Vision-Language-Action Driving

2604.01723v1 by Yun Li, Yidu Zhang, Simon Thompson, Ehsan Javanmardi, Manabu Tsukada

Vision-Language-Action (VLA) models for autonomous driving must integrate diverse textual inputs, including navigation commands, hazard warnings, and traffic state descriptions, yet current systems often present these as disconnected fragments, forcing the model to discover on its own which environmental constraints are relevant to the current maneuver. We introduce Causal Scene Narration (CSN), which restructures VLA text inputs through intent-constraint alignment, quantitative grounding, and structured separation, at inference time with zero GPU cost. We complement CSN with Simplex-based runtime safety supervision and training-time alignment via Plackett-Luce DPO with negative log-likelihood (NLL) regularization. A multi-town closed-loop CARLA evaluation shows that CSN improves Driving Score by +31.1% on original LMDrive and +24.5% on the preference-aligned variant. A controlled ablation reveals that causal structure accounts for 39.1% of this gain, with the remainder attributable to information content alone. A perception noise ablation confirms that CSN's benefit is robust to realistic sensing errors. Semantic safety supervision improves Infraction Score, while reactive Time-To-Collision monitoring degrades performance, demonstrating that intent-aware monitoring is needed for VLA systems.

摘要:視覺-語言-行動(VLA)模型在自動駕駛中必須整合多樣的文本輸入,包括導航指令、危險警告和交通狀態描述,然而目前的系統往往將這些輸入呈現為不連貫的片段,迫使模型自行發現哪些環境限制與當前的操作相關。我們提出了因果場景敘述(CSN),通過意圖-約束對齊、定量基準和結構化分離,在推理時以零 GPU 成本重構 VLA 文本輸入。我們用基於 Simplex 的運行時安全監督和通過 Plackett-Luce DPO 進行的訓練時對齊,搭配負對數似然(NLL)正則化來補充 CSN。一項多城鎮閉環 CARLA 評估顯示,CSN 在原始 LMDrive 上提高了 +31.1% 的駕駛分數,在偏好對齊變體上提高了 +24.5%。一項控制的消融實驗顯示,因果結構佔這一增益的 39.1%,其餘部分僅歸因於信息內容。感知噪聲消融確認 CSN 的好處對現實感測錯誤具有穩健性。語義安全監督改善了違規分數,而反應式碰撞時間監測則降低了性能,顯示出對意圖的監測對 VLA 系統是必要的。

Transformer self-attention encoder-decoder with multimodal deep learning for response time series forecasting and digital twin support in wind structural health monitoring

2604.01712v1 by Feiyu Zhou, Marios Impraimakis

The wind-induced structural response forecasting capabilities of a novel transformer methodology are examined here. The model also provides a digital twin component for bridge structural health monitoring. Firstly, the approach uses the temporal characteristics of the system to train a forecasting model. Secondly, the vibration predictions are compared to the measured ones to detect large deviations. Finally, the identified cases are used as an early-warning indicator of structural change. The artificial intelligence-based model outperforms approaches for response forecasting as no assumption on wind stationarity or on structural normal vibration behavior is needed. Specifically, wind-excited dynamic behavior suffers from uncertainty related to obtaining poor predictions when the environmental or traffic conditions change. This results in a hard distinction of what constitutes normal vibration behavior. To this end, a framework is rigorously examined on real-world measurements from the Hardanger Bridge monitored by the Norwegian University of Science and Technology. The approach captures accurate structural behavior in realistic conditions, and with respect to the changes in the system excitation. The results, importantly, highlight the potential of transformer-based digital twin components to serve as next-generation tools for resilient infrastructure management, continuous learning, and adaptive monitoring over the system's lifecycle with respect to temporal characteristics.

摘要:風引起的結構反應預測能力在這裡檢驗了一種新型Transformer方法。該模型還提供了一個數字雙胞胎組件,用於橋樑結構健康監測。首先,該方法利用系統的時間特徵來訓練預測模型。其次,將振動預測與實測數據進行比較,以檢測大偏差。最後,識別出的案例用作結構變化的早期預警指標。基於人工智慧的模型在反應預測方面表現優於其他方法,因為不需要對風的穩定性或結構的正常振動行為做出假設。具體而言,風激發的動態行為受到不確定性的影響,當環境或交通條件改變時,會導致預測不佳。這使得正常振動行為的界定變得困難。為此,該框架在挪威科技大學監測的哈爾丹格橋的實際測量數據上進行了嚴格檢驗。該方法在現實條件下捕捉到準確的結構行為,並考慮到系統激勵的變化。結果重要地突顯了基於Transformer的數字雙胞胎組件作為下一代工具的潛力,用於彈性基礎設施管理、持續學習和在系統生命周期內根據時間特徵進行自適應監測。

Human-Guided Reasoning with Large Language Models for Vietnamese Speech Emotion Recognition

2604.01711v1 by Truc Nguyen, Then Tran, Binh Truong, Phuoc Nguyen T. H

Vietnamese Speech Emotion Recognition (SER) remains challenging due to ambiguous acoustic patterns and the lack of reliable annotated data, especially in real-world conditions where emotional boundaries are not clearly separable. To address this problem, this paper proposes a human-machine collaborative framework that integrates human knowledge into the learning process rather than relying solely on data-driven models. The proposed framework is centered around LLM-based reasoning, where acoustic feature-based models are used to provide auxiliary signals such as confidence and feature-level evidence. A confidence-based routing mechanism is introduced to distinguish between easy and ambiguous samples, allowing uncertain cases to be delegated to LLMs for deeper reasoning guided by structured rules derived from human annotation behavior. In addition, an iterative refinement strategy is employed to continuously improve system performance through error analysis and rule updates. Experiments are conducted on a Vietnamese speech dataset of 2,764 samples across three emotion classes (calm, angry, panic), with high inter-annotator agreement (Fleiss Kappa = 0.8574), ensuring reliable ground truth. The proposed method achieves strong performance, reaching up to 86.59% accuracy and Macro F1 around 0.85-0.86, demonstrating its effectiveness in handling ambiguous and hard-to-classify cases. Overall, this work highlights the importance of combining data-driven models with human reasoning, providing a robust and model-agnostic approach for speech emotion recognition in low-resource settings.

摘要:越南語語音情感識別(SER)由於模糊的聲學模式和缺乏可靠的標註數據,仍然面臨挑戰,特別是在情感邊界不明確的現實條件下。為了解決這個問題,本文提出了一個人機協作框架,將人類知識整合到學習過程中,而不是僅僅依賴數據驅動的模型。所提出的框架圍繞基於大語言模型(LLM)的推理展開,其中使用基於聲學特徵的模型提供輔助信號,例如信心和特徵級證據。引入了一種基於信心的路由機制,以區分簡單和模糊的樣本,允許不確定的案例委託給LLM進行更深入的推理,這些推理受到來自人類標註行為的結構化規則的指導。此外,採用了一種迭代精煉策略,通過錯誤分析和規則更新不斷提高系統性能。在一個包含2,764個樣本的越南語語音數據集上進行了實驗,涵蓋三個情感類別(平靜、憤怒、驚慌),具有高的標註者間一致性(Fleiss Kappa = 0.8574),確保了可靠的真實標準。所提出的方法達到了強大的性能,準確率高達86.59%,宏觀F1約為0.85-0.86,顯示出其在處理模糊和難以分類的案例中的有效性。總體而言,這項工作突顯了將數據驅動模型與人類推理相結合的重要性,提供了一種強健且與模型無關的語音情感識別方法,適用於資源匱乏的環境。

Medical explainable AI

Publish Date Title Authors Homepage Code
2026-04-02 VISTA: Visualization of Token Attribution via Efficient Analysis Syed Ahmed et.al. 2604.02217v1 null
2026-04-02 Multi-Agent Video Recommenders: Evolution, Patterns, and Open Challenges Srivaths Ranganathan et.al. 2604.02211v1 null
2026-04-02 Tracking the emergence of linguistic structure in self-supervised models learning from speech Marianne de Heer Kloots et.al. 2604.02043v1 null
2026-04-02 Abnormal Head Movements in Neurological Conditions: A Knowledge-Based Dataset with Application to Cervical Dystonia Saja Al-Dabet et.al. 2604.01962v1 null
2026-04-02 Reliable News or Propagandist News? A Neurosymbolic Model Using Genre, Topic, and Persuasion Techniques to Improve Robustness in Classification Géraud Faye et.al. 2604.01936v1 null
2026-04-02 Development and multi-center evaluation of domain-adapted speech recognition for human-AI teaming in real-world gastrointestinal endoscopy Ruijie Yang et.al. 2604.01705v1 null
2026-04-02 Scale over Preference: The Impact of AI-Generated Content on Online Content Ecology Tianhao Shi et.al. 2604.01690v1 null
2026-04-02 Ontology-Aware Design Patterns for Clinical AI Systems: Translating Reification Theory into Software Architecture Florian Odi Stummer et.al. 2604.01661v1 null
2026-04-02 Do Large Language Models Mentalize When They Teach? Sevan K. Harootonian et.al. 2604.01594v1 null
2026-04-02 PHMForge: A Scenario-Driven Agentic Benchmark for Industrial Asset Lifecycle Maintenance Ayan Das et.al. 2604.01532v1 null
2026-04-01 Type-Checked Compliance: Deterministic Guardrails for Agentic Financial Systems Using Lean 4 Theorem Proving Devakh Rashie et.al. 2604.01483v1 null
2026-04-01 When AI Gets it Wong: Reliability and Risk in AI-Assisted Medication Decision Systems Khalid Adnan Alsayed et.al. 2604.01449v1 null
2026-04-01 Reproducible, Explainable, and Effective Evaluations of Agentic AI for Software Engineering Jingyue Li et.al. 2604.01437v1 null
2026-04-01 Semantic Modeling for World-Centered Architectures Andrei Mantsivoda et.al. 2604.01359v1 null
2026-04-01 Safety, Security, and Cognitive Risks in World Models Manoj Parmar et.al. 2604.01346v1 null
2026-04-01 The Recipe Matters More Than the Kitchen:Mathematical Foundations of the AI Weather Prediction Pipeline Piyush Garg et.al. 2604.01215v1 null
2026-04-01 The Overlooked Repetitive Lengthening Form in Sentiment Analysis Lei Wang et.al. 2604.01268v1 null
2026-04-01 PsychAgent: An Experience-Driven Lifelong Learning Agent for Self-Evolving Psychological Counselor Yutao Yang et.al. 2604.00931v2 null
2026-04-01 Thinking Wrong in Silence: Backdoor Attacks on Continuous Latent Reasoning Swapnil Parekh et.al. 2604.00770v1 null
2026-04-01 Towards Initialization-dependent and Non-vacuous Generalization Bounds for Overparameterized Shallow Neural Networks Yunwen Lei et.al. 2604.00505v1 null
2026-04-01 A Reasoning-Enabled Vision-Language Foundation Model for Chest X-ray Interpretation Yabin Zhang et.al. 2604.00493v1 null
2026-04-01 Improving Generalization of Deep Learning for Brain Metastases Segmentation Across Institutions Yuchen Yang et.al. 2604.00397v1 null
2026-03-31 Collaborative AI Agents and Critics for Fault Detection and Cause Analysis in Network Telemetry Syed Eqbal Alam et.al. 2604.00319v1 null
2026-03-31 AI-Mediated Explainable Regulation for Justice Thomas Hofweber et.al. 2604.00237v1 null
2026-03-31 Explainable AI for Blind and Low-Vision Users: Navigating Trust, Modality, and Interpretability in the Agentic Era Abu Noman Md Sakib et.al. 2604.00187v1 null
2026-03-31 Physiological and Semantic Patterns in Medical Teams Using an Intelligent Tutoring System Xiaoshan Huang et.al. 2603.29950v1 null
2026-03-31 Uncertainty Gating for Cost-Aware Explainable Artificial Intelligence Georgii Mikriukov et.al. 2603.29915v1 null
2026-03-31 Reasoning-Driven Synthetic Data Generation and Evaluation Tim R. Davidson et.al. 2603.29791v1 null
2026-03-31 CausalPulse: An Industrial-Grade Neurosymbolic Multi-Agent Copilot for Causal Diagnostics in Smart Manufacturing Chathurangi Shyalika et.al. 2603.29755v1 null
2026-03-31 Symphony for Medical Coding: A Next-Generation Agentic System for Scalable and Explainable Medical Coding Joakim Edin et.al. 2603.29709v1 null
2026-03-31 Beyond the Steeper Curve: AI-Mediated Metacognitive Decoupling and the Limits of the Dunning-Kruger Metaphor Christopher Koch et.al. 2603.29681v1 null
2026-03-31 AI-Generated Prior Authorization Letters: Strong Clinical Content, Weak Administrative Scaffolding Moiz Sadiq Awan et.al. 2603.29366v1 null
2026-03-31 Rigorous Explanations for Tree Ensembles Yacine Izza et.al. 2603.29361v1 null
2026-03-31 Predicting Neuromodulation Outcome for Parkinson's Disease with Generative Virtual Brain Model Siyuan Du et.al. 2603.29176v1 null
2026-03-31 Knowledge database development by large language models for countermeasures against viruses and marine toxins Hung N. Do et.al. 2603.29149v1 null
2026-03-31 Towards Explainable Stakeholder-Aware Requirements Prioritisation in Aged-Care Digital Health Yuqing Xiao et.al. 2603.29114v1 null
2026-03-30 A Latent Risk-Aware Machine Learning Approach for Predicting Operational Success in Clinical Trials based on TrialsBank Iness Halimi et.al. 2603.29041v1 null
2026-03-30 Towards a Medical AI Scientist Hongtao Wu et.al. 2603.28589v1 null
2026-03-30 Detecting low left ventricular ejection fraction from ECG using an interpretable and scalable predictor-driven framework Ya Zhou et.al. 2603.28532v1 null
2026-03-30 RAD-LAD: Rule and Language Grounded Autonomous Driving in Real-Time Anurag Ghosh et.al. 2603.28522v2 null
2026-03-30 The Unreasonable Effectiveness of Scaling Laws in AI Chien-Ping Lu et.al. 2603.28507v1 null
2026-03-30 CiQi-Agent: Aligning Vision, Tools and Aesthetics in Multimodal Agent for Cultural Reasoning on Chinese Porcelains Wenhan Wang et.al. 2603.28474v1 null
2026-03-30 The Scaffold Effect: How Prompt Framing Drives Apparent Multimodal Gains in Clinical VLM Evaluation Doan Nam Long Vu et.al. 2603.28387v1 null
2026-03-30 Mapping data literacy trajectories in K-12 education Robert Whyte et.al. 2603.28317v1 null
2026-03-30 Adversarial Attacks on Multimodal Large Language Models: A Comprehensive Survey Bhavuk Jain et.al. 2603.27918v1 null
2026-03-29 ImagenWorld: Stress-Testing Image Generation Models with Explainable Human Evaluation on Open-ended Real-World Tasks Samin Mahdizadeh Sani et.al. 2603.27862v1 null
2026-03-29 What-If Explanations Over Time: Counterfactuals for Time Series Classification Udo Schlegel et.al. 2603.27792v1 null
2026-03-29 TianJi:An autonomous AI meteorologist for discovering physical mechanisms in atmospheric science Kaikai Zhang et.al. 2603.27738v1 null
2026-03-29 Multi-Agent Dialectical Refinement for Enhanced Argument Classification Jakub Bąba et.al. 2603.27451v1 null
2026-03-28 Culturally Adaptive Explainable LLM Assessment for Multilingual Information Disorder: A Human-in-the-Loop Approach Maziar Kianimoghadam Jouneghani et.al. 2603.27356v1 null
2026-03-28 Improving Automated Wound Assessment Using Joint Boundary Segmentation and Multi-Class Classification Models Mehedi Hasan Tusar et.al. 2603.27325v1 null
2026-03-28 MediHive: A Decentralized Agent Collective for Medical Reasoning Xiaoyang Wang et.al. 2603.27150v1 null
2026-03-28 Debiasing Large Language Models toward Social Factors in Online Behavior Analytics through Prompt Knowledge Tuning Hossein Salemi et.al. 2603.27057v1 null
2026-03-27 PRISMA: Toward a Normative Information Infrastructure for Responsible Pharmaceutical Knowledge Management Eugenio Rodrigo Zimmer Neves et.al. 2603.26324v1 null
2026-03-27 Sparse Auto-Encoders and Holism about Large Language Models Jumbly Grindrod et.al. 2603.26207v1 null
2026-03-27 Concerning Uncertainty -- A Systematic Survey of Uncertainty-Aware XAI Helena Löfström et.al. 2603.26838v1 null
2026-03-27 SkinGPT-X: A Self-Evolving Collaborative Multi-Agent System for Transparent and Trustworthy Dermatological Diagnosis Zhangtianyi Chen et.al. 2603.26122v1 null
2026-03-27 DPD-Cancer: Explainable Graph-based Deep Learning for Small Molecule Anti-Cancer Activity Prediction Magnus H. Strømme et.al. 2603.26114v1 null
2026-03-27 A Regression Framework for Understanding Prompt Component Impact on LLM Performance Andrew Lauziere et.al. 2603.26830v1 null
2026-03-27 FairLLaVA: Fairness-Aware Parameter-Efficient Fine-Tuning for Large Vision-Language Assistants Mahesh Bhosale et.al. 2603.26008v1 null
2026-03-26 Do Neurons Dream of Primitive Operators? Wake-Sleep Compression Rediscovers Schank's Event Semantics Peter Balogh et.al. 2603.25975v1 null
2026-03-26 Methods for Knowledge Graph Construction from Text Collections: Development and Applications Vanni Zavarella et.al. 2603.25862v1 null
2026-03-26 A Compression Perspective on Simplicity Bias Tom Marty et.al. 2603.25839v1 null
2026-03-26 Doctorina MedBench: End-to-End Evaluation of Agent-Based Medical AI Anna Kozlova et.al. 2603.25821v1 null
2026-03-26 DeepFAN, a transformer-based deep learning model for human-artificial intelligence collaborative assessment of incidental pulmonary nodules in CT scans: a multi-reader, multi-case trial Zhenchen Zhu et.al. 2603.25607v1 null
2026-03-26 From Manipulation to Mistrust: Explaining Diverse Micro-Video Misinformation for Robust Debunking in the Wild Zhi Zeng et.al. 2603.25423v1 null
2026-03-26 Shape and Substance: Dual-Layer Side-Channel Attacks on Local Vision-Language Models Eyal Hadad et.al. 2603.25403v2 null
2026-03-26 4OPS: Structural Difficulty Modeling in Integer Arithmetic Puzzles Yunus E. Zeytuncu et.al. 2603.25356v1 null
2026-03-26 Evaluating Language Models for Harmful Manipulation Canfer Akbulut et.al. 2603.25326v2 null
2026-03-26 DAGverse: Building Document-Grounded Semantic DAGs from Scientific Papers Shu Wan et.al. 2603.25293v1 null
2026-03-26 Does Explanation Correctness Matter? Linking Computational XAI Evaluation to Human Understanding Gregor Baer et.al. 2603.25251v1 null
2026-03-26 Factors Influencing the Quality of AI-Generated Code: A Synthesis of Empirical Evidence Vehid Geruslu et.al. 2603.25146v1 null
2026-03-26 An Explainable Ensemble Learning Framework for Crop Classification with Optimized Feature Pyramids and Deep Networks Syed Rayhan Masud et.al. 2603.25070v1 null
2026-03-26 Efficient Detection of Bad Benchmark Items with Novel Scalability Coefficients Michael Hardy et.al. 2603.24999v2 null
2026-03-26 Rethinking Health Agents: From Siloed AI to Collaborative Decision Mediators Ray-Yuan Chung et.al. 2603.24986v1 null
2026-03-26 Self-Corrected Image Generation with Explainable Latent Rewards Yinyi Luo et.al. 2603.24965v1 null
2026-03-26 Can MLLMs Read Students' Minds? Unpacking Multimodal Error Analysis in Handwritten Math Dingjie Song et.al. 2603.24961v1 null
2026-03-26 Explaining, Verifying, and Aligning Semantic Hierarchies in Vision-Language Model Embeddings Gesina Schwalbe et.al. 2603.26798v1 null
2026-03-26 Sovereign AI at the Front Door of Care: A Physically Unidirectional Architecture for Secure Clinical Intelligence Vasu Srinivasan et.al. 2603.24898v1 null
2026-03-25 More Than "Means to an End": Supporting Reasoning with Transparently Designed AI Data Science Processes Venkatesh Sivaraman et.al. 2603.24877v1 null
2026-03-25 A Practical Guide Towards Interpreting Time-Series Deep Clinical Predictive Models: A Reproducibility Study Yongda Fan et.al. 2603.24828v1 null
2026-03-25 PhyDCM: A Reproducible Open-Source Framework for AI-Assisted Brain Tumor Classification from Multi-Sequence MRI Hayder Saad Abdulbaqi et.al. 2603.26794v1 null
2026-03-25 Dissecting Model Failures in Abdominal Aortic Aneurysm Segmentation through Explainability-Driven Analysis Abu Noman Md Sakib et.al. 2603.24801v1 null
2026-03-25 A Sociolinguistic Analysis of Automatic Speech Recognition Bias in Newcastle English Dana Serditova et.al. 2603.24549v1 null
2026-03-25 No Single Metric Tells the Whole Story: A Multi-Dimensional Evaluation Framework for Uncertainty Attributions Emily Schiller et.al. 2603.24524v1 null
2026-03-25 Multi-Agent Reasoning with Consistency Verification Improves Uncertainty Calibration in Medical MCQA John Ray B. Martinez et.al. 2603.24481v1 null
2026-03-25 Integrating Causal Machine Learning into Clinical Decision Support Systems: Insights from Literature and Practice Domenique Zipperling et.al. 2603.24448v1 null
2026-03-25 Enes Causal Discovery Alexis Kafantaris et.al. 2603.24436v3 null
2026-03-25 From Untamed Black Box to Interpretable Pedagogical Orchestration: The Ensemble of Specialized LLMs Architecture for Adaptive Tutoring Nizam Kadir et.al. 2603.23990v1 null
2026-03-25 Generative AI User Experience: Developing Human--AI Epistemic Partnership Xiaoming Zhai et.al. 2603.23863v1 null
2026-03-24 Causal AI For AMS Circuit Design: Interpretable Parameter Effects Analysis Mohyeu Hussain et.al. 2603.24618v1 null
2026-03-24 Learning What Can Be Picked: Active Reachability Estimation for Efficient Robotic Fruit Harvesting Nur Afsa Syeda et.al. 2603.23679v1 null
2026-03-24 Evaluating a Multi-Agent Voice-Enabled Smart Speaker for Care Homes: A Safety-Focused Framework Zeinab Dehghani et.al. 2603.23625v1 null
2026-03-24 AI Generalisation Gap In Comorbid Sleep Disorder Staging Saswata Bose et.al. 2603.23582v2 null
2026-03-24 A Learning Method with Gap-Aware Generation for Heterogeneous DAG Scheduling Ruisong Zhou et.al. 2603.23249v1 null
2026-03-24 Why AI-Generated Text Detection Fails: Evidence from Explainable AI Beyond Benchmark Accuracy Shushanta Pudasaini et.al. 2603.23146v1 null
2026-03-24 HUydra: Full-Range Lung CT Synthesis via Multiple HU Interval Generative Modelling António Cardoso et.al. 2603.23041v1 null
2026-03-24 Concept-based explanations of Segmentation and Detection models in Natural Disaster Management Samar Heydari et.al. 2603.23020v1 null
2026-03-24 On the use of Aggregation Operators to improve Human Identification using Dental Records Antonio D. Villegas-Yeguas et.al. 2603.23003v1 null
2026-03-24 Ran Score: a LLM-based Evaluation Score for Radiology Report Generation Ran Zhang et.al. 2603.22935v1 null

Abstracts

VISTA: Visualization of Token Attribution via Efficient Analysis

2604.02217v1 by Syed Ahmed, Bharathi Vokkaliga Ganesh, Jagadish Babu P, Karthick Selvaraj, Praneeth Talluri, Sanket Hingne, Anubhav Kumar, Anushka Yadav, Pratham Kumar Verma, Kiranmayee Janardhan, Mandanna A N

Understanding how Large Language Models (LLMs) process information from prompts remains a significant challenge. To shed light on this "black box," attention visualization techniques have been developed to capture neuron-level perceptions and interpret how models focus on different parts of input data. However, many existing techniques are tailored to specific model architectures, particularly within the Transformer family, and often require backpropagation, resulting in nearly double the GPU memory usage and increased computational cost. A lightweight, model-agnostic approach for attention visualization remains lacking. In this paper, we introduce a model-agnostic token importance visualization technique to better understand how generative AI systems perceive and prioritize information from input text, without incurring additional computational cost. Our method leverages perturbation-based strategies combined with a three-matrix analytical framework to generate relevance maps that illustrate token-level contributions to model predictions. The framework comprises: (1) the Angular Deviation Matrix, which captures shifts in semantic direction; (2) the Magnitude Deviation Matrix, which measures changes in semantic intensity; and (3) the Dimensional Importance Matrix, which evaluates contributions across individual vector dimensions. By systematically removing each token and measuring the resulting impact across these three complementary dimensions, we derive a composite importance score that provides a nuanced and mathematically grounded measure of token significance. To support reproducibility and foster wider adoption, we provide open-source implementations of all proposed and utilized explainability techniques, with code and resources publicly available at https://github.com/Infosys/Infosys-Responsible-AI-Toolkit

摘要:理解大型語言模型(LLMs)如何處理來自提示的信息仍然是一個重大挑戰。為了揭示這個「黑箱」,已開發出注意力可視化技術,以捕捉神經元級的感知並解釋模型如何專注於輸入數據的不同部分。然而,許多現有技術是針對特定模型架構量身定制的,特別是在Transformer家族內,並且通常需要反向傳播,導致幾乎雙倍的GPU內存使用和增加的計算成本。缺乏一種輕量級的、與模型無關的注意力可視化方法。在本文中,我們介紹了一種與模型無關的標記重要性可視化技術,以更好地理解生成式AI系統如何感知和優先考慮來自輸入文本的信息,而不增加額外的計算成本。我們的方法利用基於擾動的策略,結合三個矩陣的分析框架,生成顯示標記對模型預測貢獻的相關性圖。該框架包括:(1)角度偏差矩陣,用於捕捉語義方向的變化;(2)幅度偏差矩陣,用於測量語義強度的變化;以及(3)維度重要性矩陣,用於評估各個向量維度的貢獻。通過系統地移除每個標記並測量在這三個互補維度上的結果影響,我們得出一個綜合重要性分數,提供了一種細緻且數學上有根據的標記重要性度量。為了支持可重複性並促進更廣泛的採用,我們提供了所有提議和使用的可解釋性技術的開源實現,代碼和資源可在 https://github.com/Infosys/Infosys-Responsible-AI-Toolkit 上公開獲得。

Multi-Agent Video Recommenders: Evolution, Patterns, and Open Challenges

2604.02211v1 by Srivaths Ranganathan, Abhishek Dharmaratnakar, Anushree Sinha, Debanshu Das

Video recommender systems are among the most popular and impactful applications of AI, shaping content consumption and influencing culture for billions of users. Traditional single-model recommenders, which optimize static engagement metrics, are increasingly limited in addressing the dynamic requirements of modern platforms. In response, multi-agent architectures are redefining how video recommender systems serve, learn, and adapt to both users and datasets. These agent-based systems coordinate specialized agents responsible for video understanding, reasoning, memory, and feedback, to provide precise, explainable recommendations. In this survey, we trace the evolution of multi-agent video recommendation systems (MAVRS). We combine ideas from multi-agent recommender systems, foundation models, and conversational AI, culminating in the emerging field of large language model (LLM)-powered MAVRS. We present a taxonomy of collaborative patterns and analyze coordination mechanisms across diverse video domains, ranging from short-form clips to educational platforms. We discuss representative frameworks, including early multi-agent reinforcement learning (MARL) systems such as MMRF and recent LLM-driven architectures like MACRec and Agent4Rec, to illustrate these patterns. We also outline open challenges in scalability, multimodal understanding, incentive alignment, and identify research directions such as hybrid reinforcement learning-LLM systems, lifelong personalization and self-improving recommender systems.

摘要:視頻推薦系統是人工智慧中最受歡迎和影響力最大的應用之一,塑造了數十億用戶的內容消費並影響文化。傳統的單一模型推薦系統,優化靜態參與指標,越來越難以滿足現代平台的動態需求。為此,多代理架構正在重新定義視頻推薦系統如何為用戶和數據集提供服務、學習和適應。這些基於代理的系統協調專門的代理,負責視頻理解、推理、記憶和反饋,以提供精確且可解釋的推薦。在這項調查中,我們追溯了多代理視頻推薦系統(MAVRS)的演變。我們結合了多代理推薦系統、基礎模型和對話式人工智慧的理念,最終形成了大型語言模型(LLM)驅動的MAVRS的新興領域。我們提出了一個合作模式的分類法,並分析了在不同視頻領域中的協調機制,這些領域從短片到教育平台不等。我們討論了代表性的框架,包括早期的多代理強化學習(MARL)系統如MMRF,以及最近的LLM驅動架構如MACRec和Agent4Rec,以說明這些模式。我們還概述了在可擴展性、多模態理解、激勵對齊方面的挑戰,並確定了研究方向,如混合強化學習-LLM系統、終身個性化和自我改善的推薦系統。

Tracking the emergence of linguistic structure in self-supervised models learning from speech

2604.02043v1 by Marianne de Heer Kloots, Martijn Bentum, Hosein Mohebbi, Charlotte Pouw, Gaofei Shen, Willem Zuidema

Self-supervised speech models learn effective representations of spoken language, which have been shown to reflect various aspects of linguistic structure. But when does such structure emerge in model training? We study the encoding of a wide range of linguistic structures, across layers and intermediate checkpoints of six Wav2Vec2 and HuBERT models trained on spoken Dutch. We find that different levels of linguistic structure show notably distinct layerwise patterns as well as learning trajectories, which can partially be explained by differences in their degree of abstraction from the acoustic signal and the timescale at which information from the input is integrated. Moreover, we find that the level at which pre-training objectives are defined strongly affects both the layerwise organization and the learning trajectories of linguistic structures, with greater parallelism induced by higher-order prediction tasks (i.e. iteratively refined pseudo-labels).

摘要:自我監督的語音模型學習有效的口語語言表示,這已被證明反映了語言結構的各個方面。那麼,這種結構在模型訓練中何時出現?我們研究了六個在荷蘭語口語上訓練的 Wav2Vec2 和 HuBERT 模型的不同層級和中間檢查點中各種語言結構的編碼。我們發現,不同層級的語言結構顯示出顯著不同的層級模式和學習軌跡,這在某種程度上可以通過它們與聲音信號的抽象程度以及從輸入中整合信息的時間尺度的差異來解釋。此外,我們發現預訓練目標的定義層級對語言結構的層級組織和學習軌跡有強烈影響,更高階的預測任務(即反覆精煉的偽標籤)會引起更大的平行性。

Abnormal Head Movements in Neurological Conditions: A Knowledge-Based Dataset with Application to Cervical Dystonia

2604.01962v1 by Saja Al-Dabet, Sherzod Turaev, Nazar Zaki

Abnormal head movements (AHMs) manifest across a broad spectrum of neurological disorders; however, the absence of a multi-condition resource integrating kinematic measurements, clinical severity scores, and patient demographics constitutes a persistent barrier to the development of AI-driven diagnostic tools. To address this gap, this study introduces NeuroPose-AHM, a knowledge-based dataset of neurologically induced AHMs constructed through a multi-LLM extraction framework applied to 1,430 peer-reviewed publications. The dataset contains 2,756 patient-group-level records spanning 57 neurological conditions, derived from 846 AHM-relevant papers. Inter-LLM reliability analysis confirms robust extraction performance, with study-level classification achieving strong agreement (kappa = 0.822). To demonstrate the dataset's analytical utility, a four-task framework is applied to cervical dystonia (CD), the condition most directly defined by pathological head movement. First, Task 1 performs multi-label AHM type classification (F1 = 0.856). Task 2 constructs the Head-Neck Severity Index (HNSI), a unified metric that normalizes heterogeneous clinical rating scales. The clinical relevance of this index is then evaluated in Task 3, where HNSI is validated against real-world CD patient data, with aligned severe-band proportions (6.7%) providing a preliminary plausibility indication for index calibration within the high severity range. Finally, Task 4 performs bridge analysis between movement-type probabilities and HNSI scores, producing significant correlations (p less than 0.001). These results demonstrate the analytical utility of NeuroPose-AHM as a structured, knowledge-based resource for neurological AHM research. The NeuroPose-AHM dataset is publicly available on Zenodo (https://doi.org/10.5281/zenodo.19386862).

摘要:異常頭部運動(AHMs)在廣泛的神經疾病中表現出來;然而,缺乏一個整合運動學測量、臨床嚴重程度評分和患者人口統計的多條件資源,構成了開發基於人工智慧的診斷工具的持續障礙。為了解決這一問題,本研究介紹了NeuroPose-AHM,這是一個基於知識的神經誘發AHMs數據集,通過應用於1,430篇經過同行評審的出版物的多LLM提取框架構建而成。該數據集包含2,756個患者群體級別的記錄,涵蓋57種神經疾病,來源於846篇與AHM相關的論文。跨LLM可靠性分析確認了穩健的提取性能,研究級別的分類達到強一致性(kappa = 0.822)。為了展示該數據集的分析效用,將四任務框架應用於頸部肌張力障礙(CD),這是由病理性頭部運動最直接定義的疾病。首先,任務1執行多標籤AHM類型分類(F1 = 0.856)。任務2構建頭頸嚴重程度指數(HNSI),這是一個統一的指標,將異質的臨床評分標準進行標準化。然後在任務3中評估該指數的臨床相關性,其中HNSI與現實世界的CD患者數據進行驗證,對應的重度比例(6.7%)為指數在高嚴重程度範圍內的校準提供了初步的合理性指示。最後,任務4在運動類型概率和HNSI分數之間進行橋接分析,產生了顯著的相關性(p小於0.001)。這些結果展示了NeuroPose-AHM作為一個結構化的、基於知識的神經AHM研究資源的分析效用。NeuroPose-AHM數據集在Zenodo上公開可用(https://doi.org/10.5281/zenodo.19386862)。

Reliable News or Propagandist News? A Neurosymbolic Model Using Genre, Topic, and Persuasion Techniques to Improve Robustness in Classification

2604.01936v1 by Géraud Faye, Benjamin Icard, Morgane Casanova, Guillaume Gadek, Guillaume Gravier, Wassila Ouerdane, Céline Hudelot, Sylvain Gatepaille, Paul Égré

Among news disorders, propagandist news are particularly insidious, because they tend to mix oriented messages with factual reports intended to look like reliable news. To detect propaganda, extant approaches based on Language Models such as BERT are promising but often overfit their training datasets, due to biases in data collection. To enhance classification robustness and improve generalization to new sources, we propose a neurosymbolic approach combining non-contextual text embeddings (fastText) with symbolic conceptual features such as genre, topic, and persuasion techniques. Results show improvements over equivalent text-only methods, and ablation studies as well as explainability analyses confirm the benefits of the added features. Keywords: Information disorder, Fake news, Propaganda, Classification, Topic modeling, Hybrid method, Neurosymbolic model, Ablation, Robustness

摘要:在新聞失序中,宣傳新聞特別隱蔽,因為它們往往將導向性信息與看似可靠的事實報導混合在一起。要檢測宣傳,基於語言模型(如BERT)的現有方法是有前景的,但由於數據收集中的偏見,這些方法往往會過度擬合其訓練數據集。為了增強分類的穩健性並改善對新來源的泛化,我們提出了一種神經符號方法,將非上下文文本嵌入(fastText)與符號概念特徵(如類型、主題和說服技術)結合起來。結果顯示,相較於等效的僅文本方法有改善,消融研究以及可解釋性分析確認了新增特徵的好處。
關鍵詞:信息失序、假新聞、宣傳、分類、主題建模、混合方法、神經符號模型、消融、穩健性

Development and multi-center evaluation of domain-adapted speech recognition for human-AI teaming in real-world gastrointestinal endoscopy

2604.01705v1 by Ruijie Yang, Yan Zhu, Peiyao Fu, Te Luo, Zhihua Wang, Xian Yang, Quanlin Li, Pinghong Zhou, Shuo Wang

Automatic speech recognition (ASR) is a critical interface for human-AI interaction in gastrointestinal endoscopy, yet its reliability in real-world clinical settings is limited by domain-specific terminology and complex acoustic conditions. Here, we present EndoASR, a domain-adapted ASR system designed for real-time deployment in endoscopic workflows. We develop a two-stage adaptation strategy based on synthetic endoscopy reports, targeting domain-specific language modeling and noise robustness. In retrospective evaluation across six endoscopists, EndoASR substantially improves both transcription accuracy and clinical usability, reducing character error rate (CER) from 20.52% to 14.14% and increasing medical term accuracy (Med ACC) from 54.30% to 87.59%. In a prospective multi-center study spanning five independent endoscopy centers, EndoASR demonstrates consistent generalization under heterogeneous real-world conditions. Compared with the baseline Paraformer model, CER is reduced from 16.20% to 14.97%, while Med ACC is improved from 61.63% to 84.16%, confirming its robustness in practical deployment scenarios. Notably, EndoASR achieves a real-time factor (RTF) of 0.005, significantly faster than Whisper-large-v3 (RTF 0.055), while maintaining a compact model size of 220M parameters, enabling efficient edge deployment. Furthermore, integration with large language models demonstrates that improved ASR quality directly enhances downstream structured information extraction and clinician-AI interaction. These results demonstrate that domain-adapted ASR can serve as a reliable interface for human-AI teaming in gastrointestinal endoscopy, with consistent performance validated across multi-center real-world clinical settings.

摘要:自動語音識別(ASR)是人機互動中一個關鍵介面,尤其是在胃腸內視鏡檢查中,但其在現實臨床環境中的可靠性受到特定領域術語和複雜聲學條件的限制。在此,我們介紹EndoASR,一個為內視鏡工作流程實時部署而設計的領域適應ASR系統。我們基於合成內視鏡報告開發了一種兩階段適應策略,針對特定領域的語言建模和噪音穩健性。在對六位內視鏡醫生的回顧性評估中,EndoASR顯著提高了轉錄準確性和臨床可用性,將字符錯誤率(CER)從20.52%降低至14.14%,並將醫療術語準確性(Med ACC)從54.30%提高至87.59%。在一項跨越五個獨立內視鏡中心的前瞻性多中心研究中,EndoASR在異質的現實條件下顯示出一致的泛化能力。與基線Paraformer模型相比,CER從16.20%降低至14.97%,而Med ACC從61.63%提高至84.16%,確認了其在實際部署情境中的穩健性。值得注意的是,EndoASR實現了0.005的實時因子(RTF),顯著快於Whisper-large-v3(RTF 0.055),同時保持220M參數的緊湊模型大小,實現高效的邊緣部署。此外,與大型語言模型的整合顯示,改善的ASR質量直接增強了下游結構化信息提取和臨床醫生與AI的互動。這些結果表明,領域適應的ASR可以作為胃腸內視鏡中人機協作的可靠介面,其一致的性能在多中心現實臨床環境中得到了驗證。

Scale over Preference: The Impact of AI-Generated Content on Online Content Ecology

2604.01690v1 by Tianhao Shi, Yang Zhang, Xiaoyan Zhao, Fengbin Zhu, Chenyi Lei, Han Li, Wenwu Ou, Yang Song, Yongdong Zhang, Fuli Feng

The rapid proliferation of Artificial Intelligence-Generated Content (AIGC) is fundamentally restructuring online content ecologies, necessitating a rigorous examination of its behavioral and distributional implications. Leveraging a comprehensive longitudinal dataset comprising tens of millions of users from a leading Chinese video-sharing platform, this study elucidated the distinct creation and consumption behaviors characterizing AIGC versus Human-Generated Content (HGC). We identified a prevalent scale-over-preference dynamic, wherein AIGC creators achieve aggregate engagement comparable to HGC creators through high-volume production, despite a marked consumer preference for HGC. Deeper analysis uncovered the ability of the algorithmic content distribution mechanism in moderating these competing interests regarding AIGC. These findings advocated for the implementation of AIGC-sensitive distribution algorithms and precise governance frameworks to ensure the long-term health of the online content platforms.

摘要:人工智慧生成內容(AIGC)的快速增長正在根本重塑線上內容生態系統,迫切需要對其行為和分配影響進行嚴格的檢視。本研究利用來自一家領先中國視頻分享平台的數千萬用戶的綜合縱向數據集,闡明了AIGC與人類生成內容(HGC)之間的獨特創作和消費行為。我們識別出一種普遍的規模優於偏好的動態,即AIGC創作者通過高產量的生產實現了與HGC創作者相當的總體參與度,儘管消費者對HGC的偏好顯著。更深入的分析揭示了算法內容分發機制在調節這些關於AIGC的競爭利益方面的能力。這些發現提倡實施對AIGC敏感的分發算法和精確的治理框架,以確保線上內容平台的長期健康。

Ontology-Aware Design Patterns for Clinical AI Systems: Translating Reification Theory into Software Architecture

2604.01661v1 by Florian Odi Stummer

Clinical AI systems routinely train on health data structurally distorted by documentation workflows, billing incentives, and terminology fragmentation. Prior work has characterised the mechanisms of this distortion: the three-forces model of documentary enactment, the reification feedback loop through which AI may amplify coding artefacts, and terminology governance failures that allow semantic drift to accumulate. Yet translating these insights into implementable software architecture remains an open problem. This paper proposes seven ontology-aware design patterns in Gang-of-Four pattern language for building clinical AI pipelines resilient to ontological distortion. The patterns address data ingestion validation (Ontological Checkpoint), low-frequency signal preservation (Dormancy-Aware Pipeline), continuous drift monitoring (Drift Sentinel), parallel representation maintenance (Dual-Ontology Layer), feedback loop interruption (Reification Circuit Breaker), terminology evolution management (Terminology Version Gate), and pluggable regulatory compliance (Regulatory Compliance Adapter). Each pattern is specified with Problem, Forces, Solution, Consequences, Known Uses, and Related Patterns. We illustrate their composition in a reference architecture for a primary care AI system and provide a walkthrough tracing all seven patterns through a diabetes risk prediction scenario. This paper does not report empirical validation; it offers a design vocabulary grounded in theoretical analysis, subject to future evaluation in production systems. Three patterns have partial precedent in existing systems; the remaining four have not been formally described. Limitations include the absence of runtime benchmarks and restriction to the German and EU regulatory context.

摘要:臨床 AI 系統通常在受到文檔工作流程、計費激勵和術語碎片化結構性扭曲的健康數據上進行訓練。先前的研究已經描述了這種扭曲的機制:文檔執行的三力模型、AI 可能放大編碼工件的具象反饋循環,以及允許語義漂移累積的術語治理失敗。然而,將這些洞見轉化為可實施的軟體架構仍然是一個未解決的問題。本文提出了七種基於本體的設計模式,使用四人幫模式語言來構建對本體扭曲具有韌性的臨床 AI 管道。這些模式針對數據攝取驗證(本體檢查點)、低頻信號保留(休眠感知管道)、持續漂移監控(漂移哨兵)、平行表示維護(雙本體層)、反饋循環中斷(具象電路斷路器)、術語演變管理(術語版本閘)和可插拔的監管合規性(監管合規適配器)。每個模式都包含問題、力量、解決方案、後果、已知用途和相關模式的規範。我們在一個初級護理 AI 系統的參考架構中展示了它們的組合,並提供了一個通過糖尿病風險預測場景追蹤所有七種模式的步驟。本文不報告實證驗證;它提供了一個基於理論分析的設計詞彙,待未來在生產系統中進行評估。三種模式在現有系統中有部分先例;其餘四種尚未被正式描述。限制包括缺乏運行時基準和僅限於德國及歐盟的監管背景。

Do Large Language Models Mentalize When They Teach?

2604.01594v1 by Sevan K. Harootonian, Mark K. Ho, Thomas L. Griffiths, Yael Niv, Ilia Sucholutsky

How do LLMs decide what to teach next: by reasoning about a learner's knowledge, or by using simpler rules of thumb? We test this in a controlled task previously used to study human teaching strategies. On each trial, a teacher LLM sees a hypothetical learner's trajectory through a reward-annotated directed graph and must reveal a single edge so the learner would choose a better path if they replanned. We run a range of LLMs as simulated teachers and fit their trial-by-trial choices with the same cognitive models used for humans: a Bayes-Optimal teacher that infers which transitions the learner is missing (inverse planning), weaker Bayesian variants, heuristic baselines (e.g., reward based), and non-mentalizing utility models. In a baseline experiment matched to the stimuli presented to human subjects, most LLMs perform well, show little change in strategy over trials, and their graph-by-graph performance is similar to that of humans. Model comparison (BIC) shows that Bayes-Optimal teaching best explains most models' choices. When given a scaffolding intervention, models follow auxiliary inference- or reward-focused prompts, but these scaffolds do not reliably improve later teaching on heuristic-incongruent test graphs and can sometimes reduce performance. Overall, cognitive model fits provide insight into LLM tutoring policies and show that prompt compliance does not guarantee better teaching decisions.

摘要:如何決定 LLM 接下來要教什麼:是通過推理學習者的知識,還是使用更簡單的經驗法則?我們在一個受控任務中測試這一點,該任務之前用於研究人類教學策略。在每次試驗中,教師 LLM 看到一個假設學習者在一個獎勵標註的有向圖中的軌跡,並必須揭示一條邊,以便學習者在重新規劃時能選擇更好的路徑。我們運行一系列 LLM 作為模擬教師,並用與人類相同的認知模型來擬合他們的逐次選擇:一個貝葉斯最優教師,推斷學習者缺失的轉換(逆向規劃)、較弱的貝葉斯變體、啟發式基準(例如,基於獎勵的)和非心理化的效用模型。在與呈現給人類受試者的刺激相匹配的基準實驗中,大多數 LLM 表現良好,策略在試驗中變化不大,且它們的圖對圖表現與人類相似。模型比較(BIC)顯示,貝葉斯最優教學最能解釋大多數模型的選擇。當給予一個支架干預時,模型遵循輔助推理或獎勵為重點的提示,但這些支架並不可靠地改善在與啟發式不一致的測試圖上的後續教學,有時甚至會降低表現。總體而言,認知模型擬合提供了對 LLM 輔導政策的洞察,並顯示提示遵從並不保證更好的教學決策。

PHMForge: A Scenario-Driven Agentic Benchmark for Industrial Asset Lifecycle Maintenance

2604.01532v1 by Ayan Das, Dhaval Patel

Large language model (LLM) agents are increasingly deployed for complex tool-orchestration tasks, yet existing benchmarks fail to capture the rigorous demands of industrial domains where incorrect decisions carry significant safety and financial consequences. To address this critical gap, we introduce PHMForge, the first comprehensive benchmark specifically designed to evaluate LLM agents on Prognostics and Health Management (PHM) tasks through realistic interactions with domain-specific MCP servers. Our benchmark encompasses 75 expert-curated scenarios spanning 7 industrial asset classes (turbofan engines, bearings, electric motors, gearboxes, aero-engines) across 5 core task categories: Remaining Useful Life (RUL) Prediction, Fault Classification, Engine Health Analysis, Cost-Benefit Analysis, and Safety/Policy Evaluation. To enable rigorous evaluation, we construct 65 specialized tools across two MCP servers and implement execution-based evaluators with task-commensurate metrics: MAE/RMSE for regression, F1-score for classification, and categorical matching for health assessments. Through extensive evaluation of leading frameworks (ReAct, Cursor Agent, Claude Code) paired with frontier LLMs (Claude Sonnet 4.0, GPT-4o, Granite-3.0-8B), we find that even top-performing configurations achieve only 68\% task completion, with systematic failures in tool orchestration (23\% incorrect sequencing), multi-asset reasoning (14.9 percentage point degradation), and cross-equipment generalization (42.7\% on held-out datasets). We open-source our complete benchmark, including scenario specifications, ground truth templates, tool implementations, and evaluation scripts, to catalyze research in agentic industrial AI.

摘要:大型語言模型(LLM)代理人越來越多地被部署於複雜的工具協調任務中,然而現有的基準無法捕捉到工業領域的嚴格需求,在這些領域中,不正確的決策會帶來重大的安全和財務後果。為了解決這一關鍵缺口,我們推出了PHMForge,這是第一個專門設計用於評估LLM代理人在預測與健康管理(PHM)任務上的綜合基準,通過與特定領域的MCP伺服器進行現實互動。我們的基準涵蓋了75個專家策劃的場景,跨越7個工業資產類別(渦扇發動機、軸承、電動馬達、齒輪箱、航空發動機),涵蓋5個核心任務類別:剩餘使用壽命(RUL)預測、故障分類、發動機健康分析、成本效益分析和安全/政策評估。為了實現嚴格的評估,我們在兩個MCP伺服器上構建了65個專門工具,並實施了基於執行的評估者,使用與任務相稱的指標:回歸的MAE/RMSE、分類的F1分數以及健康評估的類別匹配。通過對領先框架(ReAct、Cursor Agent、Claude Code)與前沿LLM(Claude Sonnet 4.0、GPT-4o、Granite-3.0-8B)的廣泛評估,我們發現即使是表現最佳的配置也僅達到68%的任務完成率,在工具協調(23%的不正確排序)、多資產推理(下降14.9個百分點)和跨設備泛化(在保留數據集上為42.7%)方面存在系統性失敗。我們開源了完整的基準,包括場景規範、真實模板、工具實現和評估腳本,以促進代理工業AI的研究。

Type-Checked Compliance: Deterministic Guardrails for Agentic Financial Systems Using Lean 4 Theorem Proving

2604.01483v1 by Devakh Rashie, Veda Rashi

The rapid evolution of autonomous, agentic artificial intelligence within financial services has introduced an existential architectural crisis: large language models (LLMs) are probabilistic, non-deterministic systems operating in domains that demand absolute, mathematically verifiable compliance guarantees. Existing guardrail solutions -- including NVIDIA NeMo Guardrails and Guardrails AI -- rely on probabilistic classifiers and syntactic validators that are fundamentally inadequate for enforcing complex multi-variable regulatory constraints mandated by the SEC, FINRA, and OCC. This paper presents the Lean-Agent Protocol, a formal-verification-based AI guardrail platform that leverages the Aristotle neural-symbolic model developed by Harmonic AI to auto-formalize institutional policies into Lean 4 code. Every proposed agentic action is treated as a mathematical conjecture: execution is permitted if and only if the Lean 4 kernel proves that the action satisfies pre-compiled regulatory axioms. This architecture provides cryptographic-level compliance certainty at microsecond latency, directly satisfying SEC Rule 15c3-5, OCC Bulletin 2011-12, FINRA Rule 3110, and CFPB explainability mandates. A three-phase implementation roadmap from shadow verification through enterprise-scale deployment is provided.

摘要:自主、代理型人工智慧在金融服務領域的快速演變引發了一場存在性的架構危機:大型語言模型(LLMs)是概率性、非決定性的系統,運作於需要絕對、數學上可驗證的合規保證的領域。現有的防護解決方案——包括NVIDIA NeMo Guardrails和Guardrails AI——依賴於概率分類器和語法驗證器,這些工具在執行由SEC、FINRA和OCC所要求的複雜多變量監管約束方面根本不夠充分。本文提出了Lean-Agent Protocol,一個基於形式驗證的AI防護平台,利用Harmonic AI開發的亞里士多德神經符號模型,將機構政策自動形式化為Lean 4代碼。每一個提議的代理行動都被視為數學猜想:只有當Lean 4內核證明該行動滿足預編譯的監管公理時,才允許執行。這種架構在微秒延遲內提供加密級別的合規確定性,直接滿足SEC第15c3-5條規則、OCC公告2011-12、FINRA第3110條規則和CFPB可解釋性要求。提供了一個從影子驗證到企業級部署的三階段實施路線圖。

When AI Gets it Wong: Reliability and Risk in AI-Assisted Medication Decision Systems

2604.01449v1 by Khalid Adnan Alsayed

Artificial intelligence (AI) systems are increasingly integrated into healthcare and pharmacy workflows, supporting tasks such as medication recommendations, dosage determination, and drug interaction detection. While these systems often demonstrate strong performance under standard evaluation metrics, their reliability in real-world decision-making remains insufficiently understood. In high-risk domains such as medication management, even a single incorrect recommendation can result in severe patient harm. This paper examines the reliability of AI-assisted medication systems by focusing on system failures and their potential clinical consequences. Rather than evaluating performance solely through aggregate metrics, this work shifts attention towards how errors occur and what happens when AI systems produce incorrect outputs. Through a series of controlled, simulated scenarios involving drug interactions and dosage decisions, we analyse different types of system failures, including missed interactions, incorrect risk flagging, and inappropriate dosage recommendations. The findings highlight that AI errors in medication-related contexts can lead to adverse drug reactions, ineffective treatment, or delayed care, particularly when systems are used without sufficient human oversight. Furthermore, the paper discusses the risks of over-reliance on AI recommendations and the challenges posed by limited transparency in decision-making processes. This work contributes a reliability-focused perspective on AI evaluation in healthcare, emphasising the importance of understanding failure behavior and real-world impact. It highlights the need to complement traditional performance metrics with risk-aware evaluation approaches, particularly in safety-critical domains such as pharmacy practice.

摘要:人工智慧(AI)系統越來越多地融入醫療保健和藥房工作流程中,支持藥物建議、劑量確定和藥物相互作用檢測等任務。雖然這些系統在標準評估指標下通常表現良好,但它們在現實世界決策中的可靠性仍然不足以理解。在高風險領域,如藥物管理,即使是一個錯誤的建議也可能導致患者嚴重受傷。本文通過關注系統故障及其潛在臨床後果,檢視AI輔助藥物系統的可靠性。這項工作不僅僅通過總體指標來評估性能,而是轉向關注錯誤是如何發生的以及當AI系統產生錯誤輸出時會發生什麼。通過一系列涉及藥物相互作用和劑量決策的受控模擬場景,我們分析了不同類型的系統故障,包括漏掉的相互作用、不正確的風險標記和不當的劑量建議。研究結果突顯出,AI在與藥物相關的情境中的錯誤可能導致不良藥物反應、無效治療或延遲護理,特別是在系統在缺乏足夠人類監督的情況下使用時。此外,本文還討論了對AI建議的過度依賴風險以及決策過程中透明度有限所帶來的挑戰。這項工作為醫療保健中AI評估提供了一個以可靠性為重點的視角,強調理解故障行為和現實影響的重要性。它突顯了在安全關鍵領域如藥房實踐中,將傳統性能指標與風險意識評估方法相結合的必要性。

Reproducible, Explainable, and Effective Evaluations of Agentic AI for Software Engineering

2604.01437v1 by Jingyue Li, André Storhaug

With the advancement of Agentic AI, researchers are increasingly leveraging autonomous agents to address challenges in software engineering (SE). However, the large language models (LLMs) that underpin these agents often function as black boxes, making it difficult to justify the superiority of Agentic AI approaches over baselines. Furthermore, missing information in the evaluation design description frequently renders the reproduction of results infeasible. To synthesize current evaluation practices for Agentic AI in SE, this study analyzes 18 papers on the topic, published or accepted by ICSE 2026, ICSE 2025, FSE 2025, ASE 2025, and ISSTA 2025. The analysis identifies prevailing approaches and their limitations in evaluating Agentic AI for SE, both in current research and potential future studies. To address these shortcomings, this position paper proposes a set of guidelines and recommendations designed to empower reproducible, explainable, and effective evaluations of Agentic AI in software engineering. In particular, we recommend that Agentic AI researchers make their Thought-Action-Result (TAR) trajectories and LLM interaction data, or summarized versions of these artifacts, publicly accessible. Doing so will enable subsequent studies to more effectively analyze the strengths and weaknesses of different Agentic AI approaches. To demonstrate the feasibility of such comparisons, we present a proof-of-concept case study that illustrates how TAR trajectories can support systematic analysis across approaches.

摘要:隨著代理式人工智慧(Agentic AI)的進步,研究人員越來越多地利用自主代理來解決軟體工程(SE)中的挑戰。然而,支撐這些代理的大型語言模型(LLMs)通常作為黑箱運作,使得難以證明代理式人工智慧方法相對於基準的優越性。此外,評估設計描述中缺失的信息經常使得結果的再現變得不可行。為了綜合目前在軟體工程中對代理式人工智慧的評估實踐,本研究分析了18篇相關論文,這些論文已被ICSE 2026、ICSE 2025、FSE 2025、ASE 2025和ISSTA 2025接受或發表。該分析識別了在當前研究和潛在未來研究中評估代理式人工智慧的主要方法及其局限性。為了解決這些不足之處,本立場文件提出了一套指導方針和建議,旨在促進對軟體工程中代理式人工智慧的可再現、可解釋和有效的評估。特別是,我們建議代理式人工智慧研究人員公開他們的思考-行動-結果(TAR)軌跡和大型語言模型互動數據,或這些文獻的總結版本。這樣做將使後續研究能夠更有效地分析不同代理式人工智慧方法的優缺點。為了展示這種比較的可行性,我們提出了一個概念驗證案例研究,說明TAR軌跡如何支持跨方法的系統分析。

Semantic Modeling for World-Centered Architectures

2604.01359v1 by Andrei Mantsivoda, Darya Gavrilina

We introduce world-centered multi-agent systems (WMAS) as an alternative to traditional agent-centered architectures, arguing that structured domains such as enterprises and institutional systems require a shared, explicit world representation to ensure semantic consistency, explainability, and long-term stability. We classify worlds along dimensions including ontological explicitness, normativity, etc. In WMAS, learning and coordination operate over a shared world model rather than isolated agent-local representations, enabling global consistency and verifiable system behavior. We propose semantic models as a mathematical formalism for representing such worlds. Finally, we present the Ontobox platform as a realization of WMAS.

摘要:我們介紹以世界為中心的多代理系統(WMAS),作為傳統以代理為中心架構的替代方案,並主張結構化的領域如企業和機構系統需要共享的、明確的世界表徵,以確保語義一致性、可解釋性和長期穩定性。我們根據本體明確性、規範性等維度對世界進行分類。在WMAS中,學習和協調是在共享的世界模型上進行,而不是孤立的代理本地表徵,這使得全球一致性和可驗證的系統行為成為可能。我們提出語義模型作為表示這些世界的數學形式。最後,我們展示了Ontobox平台,作為WMAS的實現。

Safety, Security, and Cognitive Risks in World Models

2604.01346v1 by Manoj Parmar

World models -- learned internal simulators of environment dynamics -- are rapidly becoming foundational to autonomous decision-making in robotics, autonomous vehicles, and agentic AI. Yet this predictive power introduces a distinctive set of safety, security, and cognitive risks. Adversaries can corrupt training data, poison latent representations, and exploit compounding rollout errors to cause catastrophic failures in safety-critical deployments. World model-equipped agents are more capable of goal misgeneralisation, deceptive alignment, and reward hacking precisely because they can simulate the consequences of their own actions. Authoritative world model predictions further foster automation bias and miscalibrated human trust that operators lack the tools to audit. This paper surveys the world model landscape; introduces formal definitions of trajectory persistence and representational risk; presents a five-profile attacker capability taxonomy; and develops a unified threat model extending MITRE ATLAS and the OWASP LLM Top 10 to the world model stack. We provide an empirical proof-of-concept on trajectory-persistent adversarial attacks (GRU-RSSM: A_1 = 2.26x amplification, -59.5% reduction under adversarial fine-tuning; stochastic RSSM proxy: A_1 = 0.65x; DreamerV3 checkpoint: non-zero action drift confirmed). We illustrate risks through four deployment scenarios and propose interdisciplinary mitigations spanning adversarial hardening, alignment engineering, NIST AI RMF and EU AI Act governance, and human-factors design. We argue that world models must be treated as safety-critical infrastructure requiring the same rigour as flight-control software or medical devices.

摘要:世界模型——學習的環境動態內部模擬器——正迅速成為機器人、自主車輛和自主人工智慧中自主決策的基礎。然而,這種預測能力引入了一組獨特的安全性、保安性和認知風險。對手可以腐敗訓練數據、毒害潛在表示,並利用累積的展開錯誤來造成在安全關鍵部署中的災難性失敗。配備世界模型的代理更容易出現目標誤泛化、欺騙性對齊和獎勵駭客,正因為它們能夠模擬自身行動的後果。權威的世界模型預測進一步助長了自動化偏見和不當校準的人類信任,操作員缺乏審計工具。
本文調查了世界模型的現狀;引入了軌跡持續性和表示風險的正式定義;提出了五種攻擊者能力分類法;並發展了一個統一的威脅模型,將MITRE ATLAS和OWASP LLM前10名擴展到世界模型堆棧。我們提供了一個關於軌跡持續性對抗攻擊的實證概念驗證(GRU-RSSM: A_1 = 2.26倍增強,對抗微調下減少59.5%;隨機RSSM代理: A_1 = 0.65倍;DreamerV3檢查點: 確認非零行動漂移)。我們通過四個部署場景說明了風險,並提出了跨學科的緩解措施,涵蓋對抗加固、對齊工程、NIST AI RMF和EU AI法案治理,以及人因設計。我們認為,世界模型必須被視為安全關鍵基礎設施,需遵循與飛行控制軟件或醫療設備相同的嚴謹性。

The Recipe Matters More Than the Kitchen:Mathematical Foundations of the AI Weather Prediction Pipeline

2604.01215v1 by Piyush Garg, Diana R. Gergel, Andrew E. Shao, Galen J. Yacalis

AI weather prediction has advanced rapidly, yet no unified mathematical framework explains what determines forecast skill. Existing theory addresses specific architectural choices rather than the learning pipeline as a whole, while operational evidence from 2023-2026 demonstrates that training methodology, loss function design, and data diversity matter at least as much as architecture selection. This paper makes two interleaved contributions. Theoretically, we construct a framework rooted in approximation theory on the sphere, dynamical systems theory, information theory, and statistical learning theory that treats the complete learning pipeline (architecture, loss function, training strategy, data distribution) rather than architecture alone. We establish a Learning Pipeline Error Decomposition showing that estimation error (loss- and data-dependent) dominates approximation error (architecture-dependent) at current scales. We develop a Loss Function Spectral Theory formalizing MSE-induced spectral blurring in spherical harmonic coordinates, and derive Out-of-Distribution Extrapolation Bounds proving that data-driven models systematically underestimate record-breaking extremes with bias growing linearly in record exceedance. Empirically, we validate these predictions via inference across ten architecturally diverse AI weather models using NVIDIA Earth2Studio with ERA5 initial conditions, evaluating six metrics across 30 initialization dates spanning all seasons. Results confirm universal spectral energy loss at high wavenumbers for MSE-trained models, rising Error Consensus Ratios showing that the majority of forecast error is shared across architectures, and linear negative bias during extreme events. A Holistic Model Assessment Score provides unified multi-dimensional evaluation, and a prescriptive framework enables mathematical evaluation of proposed pipelines before training.

摘要:AI 天氣預測迅速進步,但尚未有統一的數學框架解釋預測技能的決定因素。現有理論主要針對特定的架構選擇,而非整個學習流程,2023-2026 年的操作證據顯示,訓練方法、損失函數設計和數據多樣性至少與架構選擇一樣重要。本文做出兩個交錯的貢獻。在理論上,我們構建了一個根植於球面近似理論、動態系統理論、信息理論和統計學習理論的框架,處理完整的學習流程(架構、損失函數、訓練策略、數據分佈),而不僅僅是架構。我們建立了一個學習流程誤差分解,顯示在當前規模下,估計誤差(依賴於損失和數據)主導了近似誤差(依賴於架構)。我們發展了一個損失函數頻譜理論,形式化了在球面諧波坐標中由均方誤差引起的頻譜模糊,並推導出分佈外外推界限,證明數據驅動模型系統性地低估了創紀錄的極值,且偏差隨著紀錄超越的線性增長而增長。在實證上,我們通過在十個架構多樣的 AI 天氣模型中進行推斷,使用 NVIDIA Earth2Studio 和 ERA5 初始條件,驗證這些預測,評估跨越所有季節的 30 個初始化日期的六個指標。結果確認了對於均方誤差訓練模型在高波數下的普遍頻譜能量損失,顯示出上升的誤差共識比,表明大多數預測誤差在架構之間是共享的,並在極端事件期間出現線性負偏差。一個整體模型評估分數提供了統一的多維評估,而一個指導性框架使得在訓練前對提議的流程進行數學評估成為可能。

The Overlooked Repetitive Lengthening Form in Sentiment Analysis

2604.01268v1 by Lei Wang, Eduard Dragut

Individuals engaging in online communication frequently express personal opinions with informal styles (e.g., memes and emojis). While Language Models (LMs) with informal communications have been widely discussed, a unique and emphatic style, the Repetitive Lengthening Form (RLF), has been overlooked for years. In this paper, we explore answers to two research questions: 1) Is RLF important for sentiment analysis (SA)? 2) Can LMs understand RLF? Inspired by previous linguistic research, we curate \textbf{Lengthening}, the first multi-domain dataset with 850k samples focused on RLF for SA. Moreover, we introduce \textbf{Exp}lainable \textbf{Instruct}ion Tuning (\textbf{ExpInstruct}), a two-stage instruction tuning framework aimed to improve both performance and explainability of LLMs for RLF. We further propose a novel unified approach to quantify LMs' understanding of informal expressions. We show that RLF sentences are expressive expressions and can serve as signatures of document-level sentiment. Additionally, RLF has potential value for online content analysis. Our results show that fine-tuned Pre-trained Language Models (PLMs) can surpass zero-shot GPT-4 in performance but not in explanation for RLF. Finally, we show ExpInstruct can improve the open-sourced LLMs to match zero-shot GPT-4 in performance and explainability for RLF with limited samples. Code and sample data are available at https://github.com/Tom-Owl/OverlookedRLF

摘要:個體在進行線上溝通時,經常以非正式的風格表達個人意見(例如,迷因和表情符號)。雖然帶有非正式交流的語言模型(LMs)已被廣泛討論,但一種獨特且強調的風格——重複延長形式(RLF)卻多年來被忽視。在本文中,我們探討兩個研究問題的答案:1)RLF對情感分析(SA)是否重要?2)LMs能理解RLF嗎?受到以往語言學研究的啟發,我們策劃了\textbf{Lengthening},這是第一個專注於RLF的850k樣本的多領域數據集,旨在用於SA。此外,我們介紹了\textbf{Exp}lainable \textbf{Instruct}ion Tuning(\textbf{ExpInstruct}),這是一個兩階段的指令調整框架,旨在提高LLMs在RLF上的性能和可解釋性。我們進一步提出了一種新穎的統一方法來量化LMs對非正式表達的理解。我們顯示RLF句子是富有表現力的表達,並且可以作為文件級情感的標誌。此外,RLF對線上內容分析具有潛在價值。我們的結果顯示,經過微調的預訓練語言模型(PLMs)在性能上可以超越零樣本GPT-4,但在解釋上則不然。最後,我們顯示ExpInstruct可以改善開源LLMs,使其在性能和可解釋性上與零樣本GPT-4相匹配,儘管樣本有限。代碼和樣本數據可在https://github.com/Tom-Owl/OverlookedRLF獲得。

PsychAgent: An Experience-Driven Lifelong Learning Agent for Self-Evolving Psychological Counselor

2604.00931v2 by Yutao Yang, Junsong Li, Qianjun Pan, Jie Zhou, Kai Chen, Qin Chen, Jingyuan Zhao, Ningning Zhou, Xin Li, Liang He

Existing methods for AI psychological counselors predominantly rely on supervised fine-tuning using static dialogue datasets. However, this contrasts with human experts, who continuously refine their proficiency through clinical practice and accumulated experience. To bridge this gap, we propose an Experience-Driven Lifelong Learning Agent (\texttt{PsychAgent}) for psychological counseling. First, we establish a Memory-Augmented Planning Engine tailored for longitudinal multi-session interactions, which ensures therapeutic continuity through persistent memory and strategic planning. Second, to support self-evolution, we design a Skill Evolution Engine that extracts new practice-grounded skills from historical counseling trajectories. Finally, we introduce a Reinforced Internalization Engine that integrates the evolved skills into the model via rejection fine-tuning, aiming to improve performance across diverse scenarios. Comparative analysis shows that our approach achieves higher scores than strong general LLMs (e.g., GPT-5.4, Gemini-3) and domain-specific baselines across all reported evaluation dimensions. These results suggest that lifelong learning can improve the consistency and overall quality of multi-session counseling responses.

摘要:現有的AI心理諮詢師方法主要依賴於使用靜態對話數據集的監督微調。然而,這與人類專家形成對比,人類專家通過臨床實踐和積累的經驗不斷提升自己的專業能力。為了彌補這一差距,我們提出了一個以經驗驅動的終身學習代理(\texttt{PsychAgent})用於心理諮詢。首先,我們建立了一個針對長期多次會話互動的記憶增強規劃引擎,這確保了通過持久記憶和戰略規劃實現治療的連續性。其次,為了支持自我演化,我們設計了一個技能演化引擎,從歷史諮詢軌跡中提取基於實踐的新技能。最後,我們引入了一個強化內化引擎,通過拒絕微調將演化的技能整合到模型中,旨在提高在各種情境下的表現。比較分析顯示,我們的方法在所有報告的評估維度上都達到了比強大的通用LLM(例如,GPT-5.4、Gemini-3)和特定領域基準更高的分數。這些結果表明,終身學習可以提高多次會話諮詢回應的一致性和整體質量。

Thinking Wrong in Silence: Backdoor Attacks on Continuous Latent Reasoning

2604.00770v1 by Swapnil Parekh

A new generation of language models reasons entirely in continuous hidden states, producing no tokens and leaving no audit trail. We show that this silence creates a fundamentally new attack surface. ThoughtSteer perturbs a single embedding vector at the input layer; the model's own multi-pass reasoning amplifies this perturbation into a hijacked latent trajectory that reliably produces the attacker's chosen answer, while remaining structurally invisible to every token-level defense. Across two architectures (Coconut and SimCoT), three reasoning benchmarks, and model scales from 124M to 3B parameters, ThoughtSteer achieves >=99% attack success rate with near-baseline clean accuracy, transfers to held-out benchmarks without retraining (94-100%), evades all five evaluated active defenses, and survives 25 epochs of clean fine-tuning. We trace these results to a unifying mechanism: Neural Collapse in the latent space pulls triggered representations onto a tight geometric attractor, explaining both why defenses fail and why any effective backdoor must leave a linearly separable signature (probe AUC>=0.999). Yet a striking paradox emerges: individual latent vectors still encode the correct answer even as the model outputs the wrong one. The adversarial information is not in any single vector but in the collective trajectory, establishing backdoor perturbations as a new lens for mechanistic interpretability of continuous reasoning. Code and checkpoints are available.

摘要:一種新一代的語言模型完全在連續隱藏狀態中進行推理,產生不出任何標記,也不留下任何審計痕跡。我們顯示這種沉默創造了一個根本新的攻擊面。ThoughtSteer 在輸入層擾動單一嵌入向量;模型自身的多次推理將這個擾動放大成一條被劫持的潛在軌跡,可靠地產生攻擊者所選擇的答案,同時對每個標記級別的防禦保持結構上不可見。在兩種架構(Coconut 和 SimCoT)、三個推理基準以及從 124M 到 3B 參數的模型規模中,ThoughtSteer 實現了 >=99% 的攻擊成功率,並且在接近基線的清晰準確率下,無需重新訓練即可轉移到保留的基準(94-100%),迴避了所有五個評估的主動防禦,並在 25 個時期的清晰微調中存活下來。我們將這些結果追溯到一個統一的機制:潛在空間中的神經崩潰將觸發的表示拉向一個緊密的幾何吸引子,解釋了為什麼防禦失敗以及為什麼任何有效的後門必須留下線性可分的簽名(探測 AUC >= 0.999)。然而,一個引人注目的悖論出現了:即使模型輸出錯誤的答案,單個潛在向量仍然編碼正確的答案。對抗信息不在任何單一向量中,而是在集體軌跡中,建立了後門擾動作為連續推理機械解釋的新視角。代碼和檢查點可用。

Towards Initialization-dependent and Non-vacuous Generalization Bounds for Overparameterized Shallow Neural Networks

2604.00505v1 by Yunwen Lei, Yufeng Xie

Overparameterized neural networks often show a benign overfitting property in the sense of achieving excellent generalization behavior despite the number of parameters exceeding the number of training examples. A promising direction to explain benign overfitting is to relate generalization to the norm of distance from initialization, motivated by the empirical observations that this distance is often significantly smaller than the norm itself. However, the existing initialization-dependent complexity analyses cannot fully exploit the power of initialization since the associated bounds depend on the spectral norm of the initialization matrix, which can scale as a square-root function of the width and are therefore not effective for overparameterized models. In this paper, we develop the first \emph{fully} initialization-dependent complexity bounds for shallow neural networks with general Lipschitz activation functions, which enjoys a logarithmic dependency on the width. Our bounds depend on the path-norm of the distance from initialization, which are derived by introducing a new peeling technique to handle the challenge along with the initialization-dependent constraint. We also develop a lower bound tight up to a constant factor. Finally, we conduct empirical comparisons and show that our generalization analysis implies non-vacuous bounds for overparameterized networks.

摘要:過度參數化的神經網絡通常顯示出良性的過擬合特性,因為儘管參數數量超過訓練樣本數量,但仍能實現優秀的泛化行為。解釋良性過擬合的一個有前景的方向是將泛化與從初始化的距離的範數相關聯,這是基於實證觀察,該距離通常顯著小於範數本身。然而,現有的依賴初始化的複雜性分析無法充分利用初始化的力量,因為相關的界限依賴於初始化矩陣的譜範數,該範數可能隨著寬度的平方根函數而增長,因此對於過度參數化的模型並不有效。在本文中,我們為具有一般Lipschitz激活函數的淺層神經網絡開發了第一個\emph{完全}依賴初始化的複雜性界限,該界限在寬度上享有對數依賴。我們的界限依賴於從初始化的距離的路徑範數,這是通過引入一種新的剝離技術來處理與初始化相關的約束所衍生的。我們還開發了一個緊的下界,直到一個常數因子。最後,我們進行了實證比較,並顯示我們的泛化分析對於過度參數化的網絡暗示了非空的界限。

A Reasoning-Enabled Vision-Language Foundation Model for Chest X-ray Interpretation

2604.00493v1 by Yabin Zhang, Chong Wang, Yunhe Gao, Jiaming Liu, Maya Varma, Justin Xu, Sophie Ostmeier, Jin Long, Sergios Gatidis, Seena Dehkharghani, Arne Michalson, Eun Kyoung Hong, Christian Bluethgen, Haiwei Henry Guo, Alexander Victor Ortiz, Stephan Altmayer, Sandhya Bodapati, Joseph David Janizek, Ken Chang, Jean-Benoit Delbrouck, Akshay S. Chaudhari, Curtis P. Langlotz

Chest X-rays (CXRs) are among the most frequently performed imaging examinations worldwide, yet rising imaging volumes increase radiologist workload and the risk of diagnostic errors. Although artificial intelligence (AI) systems have shown promise for CXR interpretation, most generate only final predictions, without making explicit how visual evidence is translated into radiographic findings and diagnostic predictions. We present CheXOne, a reasoning-enabled vision-language model for CXR interpretation. CheXOne jointly generates diagnostic predictions and explicit, clinically grounded reasoning traces that connect visual evidence, radiographic findings, and these predictions. The model is trained on 14.7 million instruction and reasoning samples curated from 30 public datasets spanning 36 CXR interpretation tasks, using a two-stage framework that combines instruction tuning with reinforcement learning to improve reasoning quality. We evaluate CheXOne in zero-shot settings across visual question answering, report generation, visual grounding and reasoning assessment, covering 17 evaluation settings. CheXOne outperforms existing medical and general-domain foundation models and achieves strong performance on independent public benchmarks. A clinical reader study demonstrates that CheXOne-drafted reports are comparable to or better than resident-written reports in 55% of cases, while effectively addressing clinical indications and enhancing both report writing and CXR interpretation efficiency. Further analyses involving radiologists reveal that the generated reasoning traces show high clinical factuality and provide causal support for the final predictions, offering a plausible explanation for the performance gains. These results suggest that explicit reasoning can improve model performance, interpretability and clinical utility in AI-assisted CXR interpretation.

摘要:胸部X光檢查(CXRs)是全球最常執行的影像檢查之一,但不斷增加的影像量增加了放射科醫師的工作負擔和診斷錯誤的風險。儘管人工智慧(AI)系統在CXR解讀方面顯示出潛力,但大多數僅生成最終預測,而未明確說明視覺證據如何轉化為放射學發現和診斷預測。我們提出了CheXOne,一種具備推理能力的視覺-語言模型,用於CXR解讀。CheXOne共同生成診斷預測和明確的、臨床基礎的推理痕跡,將視覺證據、放射學發現和這些預測連結起來。該模型在從30個公共數據集中策劃的1470萬個指令和推理樣本上進行訓練,涵蓋36個CXR解讀任務,使用一種結合指令調整和強化學習的兩階段框架,以提高推理質量。我們在零樣本設置下評估CheXOne,涵蓋視覺問題回答、報告生成、視覺定位和推理評估,共涉及17個評估設置。CheXOne在現有的醫學和通用領域基礎模型中表現優於其他模型,並在獨立公共基準上取得了良好的表現。一項臨床讀者研究顯示,CheXOne撰寫的報告在55%的案例中與住院醫師撰寫的報告相當或更好,同時有效地解決臨床指徵,並提高報告撰寫和CXR解讀的效率。進一步的分析涉及放射科醫師,顯示生成的推理痕跡具有高臨床事實性,並為最終預測提供因果支持,為性能提升提供了合理的解釋。這些結果表明,明確的推理可以改善模型性能、可解釋性和在AI輔助CXR解讀中的臨床實用性。

Improving Generalization of Deep Learning for Brain Metastases Segmentation Across Institutions

2604.00397v1 by Yuchen Yang, Shuangyang Zhong, Haijun Yu, Langcuomu Suo, Hongbin Han, Florian Putz, Yixing Huang

Background: Deep learning has demonstrated significant potential for automated brain metastases (BM) segmentation; however, models trained at a singular institution often exhibit suboptimal performance at various sites due to disparities in scanner hardware, imaging protocols, and patient demographics. The goal of this work is to create a domain adaptation framework that will allow for BM segmentation to be used across multiple institutions. Methods: We propose a VAE-MMD preprocessing pipeline that combines variational autoencoders (VAE) with maximum mean discrepancy (MMD) loss, incorporating skip connections and self-attention mechanisms alongside nnU-Net segmentation. The method was tested on 740 patients from four public databases: Stanford, UCSF, UCLM, and PKG, evaluated by domain classifier's accuracy, sensitivity, precision, F1/F2 scores, surface Dice (sDice), and 95th percentile Hausdorff distance (HD95). Results: VAE-MMD reduced domain classifier accuracy from 0.91 to 0.50, indicating successful feature alignment across institutions. Reconstructed volumes attained a PSNR greater than 36 dB, maintaining anatomical accuracy. The combined method raised the mean F1 by 11.1% (0.700 to 0.778), the mean sDice by 7.93% (0.7121 to 0.7686), and reduced the mean HD95 by 65.5% (11.33 to 3.91 mm) across all four centers compared to the baseline nnU-Net. Conclusions: VAE-MMD effectively diminishes cross-institutional data heterogeneity and enhances BM segmentation generalization across volumetric, detection, and boundary-level metrics without necessitating target-domain labels, thereby overcoming a significant obstacle to the clinical implementation of AI-assisted segmentation.

摘要:背景:深度學習在自動化腦轉移(BM)分割方面顯示出顯著潛力;然而,在單一機構訓練的模型在不同地點的表現往往不佳,這是由於掃描儀硬體、影像協議和患者人口統計的差異。本研究的目標是創建一個領域適應框架,使得BM分割能夠在多個機構之間使用。
方法:我們提出了一個VAE-MMD預處理管道,該管道將變分自編碼器(VAE)與最大均值差異(MMD)損失結合,並在nnU-Net分割的基礎上融入跳躍連接和自注意力機制。該方法在來自四個公共數據庫的740名患者上進行了測試:斯坦福大學、加州大學舊金山分校、馬德里大學和PKG,通過領域分類器的準確性、靈敏度、精確度、F1/F2分數、表面Dice(sDice)和第95百分位Hausdorff距離(HD95)進行評估。
結果:VAE-MMD將領域分類器的準確性從0.91降低到0.50,表明成功實現了跨機構的特徵對齊。重建的體積達到了大於36 dB的PSNR,保持了解剖準確性。該綜合方法使得平均F1提高了11.1%(從0.700到0.778),平均sDice提高了7.93%(從0.7121到0.7686),並且在四個中心之間相比基線nnU-Net將平均HD95降低了65.5%(從11.33到3.91 mm)。
結論:VAE-MMD有效減少了跨機構數據的異質性,並增強了BM分割在體積、檢測和邊界級別指標上的泛化能力,而無需目標領域標籤,從而克服了臨床實施AI輔助分割的一個重大障礙。

Collaborative AI Agents and Critics for Fault Detection and Cause Analysis in Network Telemetry

2604.00319v1 by Syed Eqbal Alam, Zhan Shu

We develop algorithms for collaborative control of AI agents and critics in a multi-actor, multi-critic federated multi-agent system. Each AI agent and critic has access to classical machine learning or generative AI foundation models. The AI agents and critics collaborate with a central server to complete multimodal tasks such as fault detection, severity, and cause analysis in a network telemetry system, text-to-image generation, video generation, healthcare diagnostics from medical images and patient records, etcetera. The AI agents complete their tasks and send them to AI critics for evaluation. The critics then send feedback to agents to improve their responses. Collaboratively, they minimize the overall cost to the system with no inter-agent or inter-critic communication. AI agents and critics keep their cost functions or derivatives of cost functions private. Using multi-time scale stochastic approximation techniques, we provide convergence guarantees on the time-average active states of AI agents and critics. The communication overhead is a little on the system, of the order of $\mathcal{O}(m)$, for $m$ modalities and is independent of the number of AI agents and critics. Finally, we present an example of fault detection, severity, and cause analysis in network telemetry and thorough evaluation to check the algorithm's efficacy.

摘要:我們為多演員、多評論者的聯邦多智能體系統開發了協作控制的算法。每個AI智能體和評論者都可以訪問傳統機器學習或生成AI基礎模型。AI智能體和評論者與中央伺服器合作,以完成多模態任務,例如在網絡遙測系統中的故障檢測、嚴重性和原因分析、文本到圖像生成、視頻生成、從醫療影像和病歷進行的醫療診斷等。AI智能體完成其任務並將其發送給AI評論者進行評估。評論者然後向智能體發送反饋,以改善其回應。他們協同工作,最小化系統的整體成本,且不進行智能體或評論者之間的通信。AI智能體和評論者將其成本函數或成本函數的導數保持私密。使用多時間尺度隨機近似技術,我們提供了AI智能體和評論者的時間平均活動狀態的收斂保證。通信開銷對系統的影響較小,約為$\mathcal{O}(m)$,其中$m$為模態數,且與AI智能體和評論者的數量無關。最後,我們展示了一個在網絡遙測中進行故障檢測、嚴重性和原因分析的例子,以及徹底的評估以檢查算法的有效性。

AI-Mediated Explainable Regulation for Justice

2604.00237v1 by Thomas Hofweber, Andreas Sudmann, Evangelos Pournaras

Present practice of deciding on regulation faces numerous problems that make adopted regulations static, unexplained, unduly influenced by powerful interest groups, and stained with a perception of illegitimacy. These well-known problems with the regulatory process can lead to injustice and have substantial negative effects on society and democracy. We discuss a new approach that utilizes distributed artificial intelligence (AI) to make a regulatory recommendation that is explainable and adaptable by design. We outline the main components of a system that can implement this approach and show how it would resolve the problems with the present regulatory system. This approach models and reasons about stakeholder preferences with separate preference models, while it aggregates these preferences in a value sensitive way. Such recommendations can be updated due to changes in facts or in values and are inherently explainable. We suggest how stakeholders can make their preferences known to the system and how they can verify whether they were properly considered in the regulatory decision. The resulting system promises to support regulatory justice, legitimacy, and compliance.

摘要:目前的監管決策實踐面臨許多問題,使得所採納的規範變得靜態、無法解釋、過度受到強大利益團體的影響,並且帶有不合法的印象。這些監管過程中眾所周知的問題可能導致不公正,並對社會和民主產生重大負面影響。我們討論了一種新的方法,利用分散式人工智慧(AI)來提出可解釋和可調整的監管建議。我們概述了可以實施這種方法的系統的主要組成部分,並展示它如何解決當前監管系統的問題。這種方法使用獨立的偏好模型來建模和推理利益相關者的偏好,同時以價值敏感的方式聚合這些偏好。這樣的建議可以根據事實或價值的變化進行更新,並且本質上是可解釋的。我們建議利益相關者如何向系統表達他們的偏好,以及他們如何驗證這些偏好是否在監管決策中得到了適當考慮。最終的系統承諾支持監管公正、合法性和合規性。

Explainable AI for Blind and Low-Vision Users: Navigating Trust, Modality, and Interpretability in the Agentic Era

2604.00187v1 by Abu Noman Md Sakib, Protik Dey, Zijie Zhang, Taslima Akter

Explainable Artificial Intelligence (XAI) is critical for ensuring trust and accountability, yet its development remains predominantly visual. For blind and low-vision (BLV) users, the lack of accessible explanations creates a fundamental barrier to the independent use of AI-driven assistive technologies. This problem intensifies as AI systems shift from single-query tools into autonomous agents that take multi-step actions and make consequential decisions across extended task horizons, where a single undetected error can propagate irreversibly before any feedback is available. This paper investigates the unique XAI requirements of the BLV community through a comprehensive analysis of user interviews and contemporary research. By examining usage patterns across environmental perception and decision support, we identify a significant modality gap. Empirical evidence suggests that while BLV users highly value conversational explanations, they frequently experience "self-blame" for AI failures. The paper concludes with a research agenda for accessible Explainable AI in agentic systems, advocating for multimodal interfaces, blame-aware explanation design, and participatory development.

摘要:可解釋的人工智慧(XAI)對於確保信任和問責至關重要,但其發展仍主要以視覺為主。對於盲人和低視力(BLV)使用者而言,缺乏可及的解釋造成了獨立使用人工智慧驅動的輔助技術的根本障礙。隨著人工智慧系統從單一查詢工具轉變為自主代理,執行多步驟行動並在擴展的任務範疇內做出重要決策,這一問題愈加嚴重,其中一個未被檢測的錯誤可能在任何反饋可用之前無法逆轉地擴散。本文通過對用戶訪談和當代研究的綜合分析,探討了BLV社群獨特的XAI需求。通過檢視環境感知和決策支持的使用模式,我們識別出一個顯著的模態差距。實證證據表明,雖然BLV使用者非常重視對話式解釋,但他們經常對人工智慧的失敗感到“自責”。本文最後提出了一個針對代理系統中可及的可解釋人工智慧的研究議程,倡導多模態介面、關注責任的解釋設計和參與式開發。

Physiological and Semantic Patterns in Medical Teams Using an Intelligent Tutoring System

2603.29950v1 by Xiaoshan Huang, Conrad Borchers, Jiayi Zhang, Susanne P. Lajoie

Effective collaboration requires teams to manage complex cognitive and emotional states through Socially Shared Regulation of Learning (SSRL). Physiological synchrony (i.e., longitudinal alignment in physiological signals) can indicate these states, but is hard to interpret on its own. We investigate the physiological and conversational dynamics of four medical dyads diagnosing a virtual patient case using an intelligent tutoring system. Semantic shifts in dialogue were correlated with transient physiological synchrony peaks. We also coded utterance segments for SSRL and derived cosine similarity using sentence embeddings. The results showed that activating prior knowledge featured significantly lower semantic similarity than simpler task execution. High physiological synchrony was associated with lower semantic similarity, suggesting that such moments involve exploratory and varied language use. Qualitative analysis triangulated these synchrony peaks as ``pivotal moments'': successful teams synchronized during shared discovery, while unsuccessful teams peaked during shared uncertainty. This research advances human-centered AI by demonstrating how biological signals can be fused with dialogues to understand critical moments in problem solving.

摘要:有效的合作需要團隊通過社會共享學習調節(SSRL)來管理複雜的認知和情感狀態。生理同步(即生理信號的縱向對齊)可以指示這些狀態,但單獨解釋起來較為困難。我們研究了四個醫療二人組在使用智能輔導系統診斷虛擬病人案例時的生理和對話動態。對話中的語義變化與瞬時生理同步峰值相關聯。我們還對發言片段進行了SSRL編碼,並使用句子嵌入推導了餘弦相似度。結果顯示,激活先前知識的語義相似度顯著低於較簡單的任務執行。高生理同步與較低的語義相似度相關,這表明這樣的時刻涉及探索性和多樣化的語言使用。質性分析將這些同步峰值三角測量為“關鍵時刻”:成功的團隊在共享發現時同步,而不成功的團隊在共享不確定性時達到峰值。本研究通過展示如何將生物信號與對話融合,以理解問題解決中的關鍵時刻,推進了以人為中心的人工智慧。

Uncertainty Gating for Cost-Aware Explainable Artificial Intelligence

2603.29915v1 by Georgii Mikriukov, Grégoire Montavon, Marina M. -C. Höhne

Post-hoc explanation methods are widely used to interpret black-box predictions, but their generation is often computationally expensive and their reliability is not guaranteed. We propose epistemic uncertainty as a low-cost proxy for explanation reliability: high epistemic uncertainty identifies regions where the decision boundary is poorly defined and where explanations become unstable and unfaithful. This insight enables two complementary use cases: improving worst-case explanations' (routing samples to cheap or expensive XAI methods based on expected explanation reliability), andrecalling high-quality explanations' (deferring explanation generation for uncertain samples under constrained budget). Across four tabular datasets, five diverse architectures, and four XAI methods, we observe a strong negative correlation between epistemic uncertainty and explanation stability. Further analysis shows that epistemic uncertainty distinguishes not only stable from unstable explanations, but also faithful from unfaithful ones. Experiments on image classification confirm that our findings generalize beyond tabular data.

摘要:後設解釋方法被廣泛用於解釋黑箱預測,但其生成通常計算成本高昂且可靠性無法保證。我們提出將認知不確定性作為解釋可靠性的低成本代理:高認知不確定性識別出決策邊界定義不明確的區域,以及解釋變得不穩定和不忠實的地方。這一見解使得兩個互補的使用案例成為可能:改善最壞情況下的解釋(根據預期解釋可靠性將樣本路由到便宜或昂貴的XAI方法),以及召回高質量解釋(在預算受限的情況下,對不確定樣本延遲生成解釋)。在四個表格數據集、五種不同架構和四種XAI方法中,我們觀察到認知不確定性與解釋穩定性之間存在強烈的負相關。進一步分析顯示,認知不確定性不僅區分穩定和不穩定的解釋,還區分忠實和不忠實的解釋。對圖像分類的實驗確認了我們的發現超越了表格數據的普遍性。

Reasoning-Driven Synthetic Data Generation and Evaluation

2603.29791v1 by Tim R. Davidson, Benoit Seguin, Enrico Bacis, Cesar Ilharco, Hamza Harkous

Although many AI applications of interest require specialized multi-modal models, relevant data to train such models is inherently scarce or inaccessible. Filling these gaps with human annotators is prohibitively expensive, error-prone, and time-consuming, leading model builders to increasingly consider synthetic data as a scalable alternative. However, existing synthetic data generation methods often rely on manual prompts, evolutionary algorithms, or extensive seed data from the target distribution - limiting their scalability, explainability, and control. In this paper, we introduce Simula: a novel reasoning-driven framework for data generation and evaluation. It employs a seedless, agentic approach to generate synthetic datasets at scale, allowing users to define desired dataset characteristics through an explainable and controllable process that enables fine-grained resource allocation. We show the efficacy of our approach on a variety of datasets, rigorously testing both intrinsic and downstream properties. Our work (1) offers guidelines for synthetic data mechanism design, (2) provides insights into generating and evaluating synthetic data at scale, and (3) unlocks new opportunities for developing and deploying AI in domains where data scarcity or privacy concerns are paramount.

摘要:儘管許多有興趣的 AI 應用需要專門的多模態模型,但訓練這些模型所需的相關數據本質上是稀缺或無法獲得的。用人類標註者填補這些空白的成本過高、容易出錯且耗時,導致模型構建者越來越考慮將合成數據作為可擴展的替代方案。然而,現有的合成數據生成方法通常依賴手動提示、進化算法或來自目標分佈的大量種子數據——這限制了它們的可擴展性、可解釋性和控制能力。在本文中,我們介紹了 Simula:一種新穎的基於推理的數據生成和評估框架。它採用無種子、主動的方式來大規模生成合成數據集,使用戶能夠通過可解釋和可控的過程定義所需的數據集特徵,從而實現細緻的資源分配。我們展示了我們的方法在各種數據集上的有效性,並嚴格測試了內在和下游特性。我們的工作 (1) 提供了合成數據機制設計的指導方針,(2) 提供了在大規模生成和評估合成數據的洞見,並 (3) 為在數據稀缺或隱私問題至關重要的領域開發和部署 AI 開辟了新的機會。

CausalPulse: An Industrial-Grade Neurosymbolic Multi-Agent Copilot for Causal Diagnostics in Smart Manufacturing

2603.29755v1 by Chathurangi Shyalika, Utkarshani Jaimini, Cory Henson, Amit Sheth

Modern manufacturing environments demand real-time, trustworthy, and interpretable root-cause insights to sustain productivity and quality. Traditional analytics pipelines often treat anomaly detection, causal inference, and root-cause analysis as isolated stages, limiting scalability and explainability. In this work, we present CausalPulse, an industry-grade multi-agent copilot that automates causal diagnostics in smart manufacturing. It unifies anomaly detection, causal discovery, and reasoning through a neurosymbolic architecture built on standardized agentic protocols. CausalPulse is being deployed in a Robert Bosch manufacturing plant, integrating seamlessly with existing monitoring workflows and supporting real-time operation at production scale. Evaluations on both public (Future Factories) and proprietary (Planar Sensor Element) datasets show high reliability, achieving overall success rates of 98.0% and 98.73%. Per-criterion success rates reached 98.75% for planning and tool use, 97.3% for self-reflection, and 99.2% for collaboration. Runtime experiments report end-to-end latency of 50-60s per diagnostic workflow with near-linear scalability (R^2=0.97), confirming real-time readiness. Comparison with existing industrial copilots highlights distinct advantages in modularity, extensibility, and deployment maturity. These results demonstrate how CausalPulse's modular, human-in-the-loop design enables reliable, interpretable, and production-ready automation for next-generation manufacturing.

摘要:現代製造環境需要即時、可信且可解釋的根本原因洞察,以維持生產力和品質。傳統的分析流程通常將異常檢測、因果推斷和根本原因分析視為孤立的階段,限制了可擴展性和可解釋性。在這項工作中,我們提出了CausalPulse,一種行業級的多代理協同工具,能自動化智能製造中的因果診斷。它通過基於標準化代理協議的神經符號架構,統一了異常檢測、因果發現和推理。CausalPulse正在羅伯特·博世的製造工廠中部署,與現有的監控工作流程無縫整合,並支持生產規模的即時運作。對公共(未來工廠)和專有(平面傳感器元件)數據集的評估顯示出高可靠性,整體成功率達到98.0%和98.73%。按標準的成功率分別為規劃和工具使用達到98.75%,自我反思為97.3%,協作為99.2%。運行時實驗報告每個診斷工作流程的端到端延遲為50-60秒,並具有近線性可擴展性(R^2=0.97),確認了即時準備性。與現有的工業協同工具的比較突顯了在模組化、可擴展性和部署成熟度方面的明顯優勢。這些結果展示了CausalPulse的模組化、人機協作設計如何實現可靠、可解釋且適合生產的自動化,為下一代製造提供支持。

Symphony for Medical Coding: A Next-Generation Agentic System for Scalable and Explainable Medical Coding

2603.29709v1 by Joakim Edin, Andreas Motzfeldt, Simon Flachs, Lars Maaløe

Medical coding translates free-text clinical documentation into standardized codes drawn from classification systems that contain tens of thousands of entries and are updated annually. It is central to billing, clinical research, and quality reporting, yet remains largely manual, slow, and error-prone. Existing automated approaches learn to predict a fixed set of codes from labeled data, thereby preventing adaptation to new codes or different coding systems without retraining on different data. They also provide no explanation for their predictions, limiting trust in safety-critical settings. We introduce Symphony for Medical Coding, a system that approaches the task the way expert human coders do: by reasoning over the clinical narrative with direct access to the coding guidelines. This design allows Symphony to operate across any coding system and to provide span-level evidence linking each predicted code to the text that supports it. We evaluate on two public benchmarks and three real-world datasets spanning inpatient, outpatient, emergency, and subspecialty settings across the United States and the United Kingdom. Symphony achieves state-of-the-art results across all settings, establishing itself as a flexible, deployment-ready foundation for automated clinical coding.

摘要:醫療編碼將自由文本的臨床文檔轉換為標準化代碼,這些代碼來自包含數萬條條目的分類系統,並每年更新一次。它對於計費、臨床研究和質量報告至關重要,但仍然主要依賴手動操作,速度慢且容易出錯。現有的自動化方法從標記數據中學習預測固定的代碼集,因此無法在不重新訓練不同數據的情況下適應新代碼或不同的編碼系統。它們也不提供對其預測的解釋,限制了在安全關鍵環境中的信任。我們介紹了醫療編碼的交響樂系統,這一系統以專業人類編碼員的方式處理任務:通過直接訪問編碼指南來推理臨床敘事。這一設計使得交響樂能夠在任何編碼系統中運作,並提供將每個預測代碼與支持其的文本聯繫起來的跨度級證據。我們在兩個公共基準和三個涵蓋美國和英國住院、門診、急診和專科環境的真實世界數據集上進行評估。交響樂在所有環境中都達到了最先進的結果,確立了其作為自動化臨床編碼的靈活、可部署的基礎。

Beyond the Steeper Curve: AI-Mediated Metacognitive Decoupling and the Limits of the Dunning-Kruger Metaphor

2603.29681v1 by Christopher Koch

The common claim that generative AI simply amplifies the Dunning-Kruger effect is too coarse to capture the available evidence. The clearest findings instead suggest that large language model (LLM) use can improve observable output and short-term task performance while degrading metacognitive accuracy and flattening the classic competence-confidence gradient across skill groups. This paper synthesizes evidence from human-AI interaction, learning research, and model evaluation, and proposes the working model of AI-mediated metacognitive decoupling: a widening gap among produced output, underlying understanding, calibration accuracy, and self-assessed ability. This four-variable account better explains overconfidence, over- and under-reliance, crutch effects, and weak transfer than the simpler metaphor of a uniformly steeper Dunning-Kruger curve. The paper concludes with implications for tool design, assessment, and knowledge work.

摘要:一般聲稱生成式 AI 只是放大了鄧寇克效應的說法過於粗糙,無法捕捉現有的證據。相反,最明確的研究結果表明,大型語言模型 (LLM) 的使用可以改善可觀察的輸出和短期任務表現,同時降低元認知的準確性,並在技能組之間平坦化經典的能力-信心梯度。本文綜合了來自人機互動、學習研究和模型評估的證據,並提出了 AI 介導的元認知解耦的工作模型:產出、基礎理解、校準準確性和自我評估能力之間的差距擴大。這四個變量的解釋比簡單的均勻陡峭的鄧寇克曲線更好地解釋了過度自信、過度和不足依賴、拐杖效應以及弱轉移。本文最後討論了對工具設計、評估和知識工作的影響。

AI-Generated Prior Authorization Letters: Strong Clinical Content, Weak Administrative Scaffolding

2603.29366v1 by Moiz Sadiq Awan, Maryam Raza

Prior authorization remains one of the most burdensome administrative processes in U.S. healthcare, consuming billions of dollars and thousands of physician hours each year. While large language models have shown promise across clinical text tasks, their ability to produce submission-ready prior authorization letters has received only limited attention, with existing work confined to single-case demonstrations rather than structured multi-scenario evaluation. We assessed three commercially available LLMs (GPT-4o, Claude Sonnet 4.5, and Gemini 2.5 Pro) across 45 physician-validated synthetic scenarios spanning rheumatology, psychiatry, oncology, cardiology, and orthopedics. All three models generated letters with strong clinical content: accurate diagnoses, well-structured medical necessity arguments, and thorough step therapy documentation. However, a secondary analysis of real-world administrative requirements revealed consistent gaps that clinical scoring alone did not capture, including absent billing codes, missing authorization duration requests, and inadequate follow-up plans. These findings reframe the question: the challenge for clinical deployment is not whether LLMs can write clinically adequate letters, but whether the systems built around them can supply the administrative precision that payer workflows require.

摘要:事前授權仍然是美國醫療保健中最繁重的行政流程之一,每年耗費數十億美元和數千小時的醫生時間。雖然大型語言模型在臨床文本任務中顯示出潛力,但它們產生提交準備好的事前授權信的能力僅受到有限的關注,現有的工作僅限於單一案例的示範,而非結構化的多場景評估。我們評估了三種商業可用的LLM(GPT-4o、Claude Sonnet 4.5和Gemini 2.5 Pro),涵蓋了45個醫生驗證的合成場景,涉及風濕病學、精神病學、腫瘤學、心臟病學和骨科。這三個模型生成的信件具有強大的臨床內容:準確的診斷、結構良好的醫療必要性論證以及全面的步驟治療文檔。然而,對現實世界行政要求的次要分析揭示了一致的差距,這些差距僅僅依靠臨床評分無法捕捉,包括缺失的計費代碼、缺少的授權期限請求和不充分的後續計劃。這些發現重新框定了問題:臨床部署的挑戰不在於LLM是否能撰寫臨床上合格的信件,而在於圍繞它們構建的系統是否能提供支付者工作流程所需的行政精確性。

Rigorous Explanations for Tree Ensembles

2603.29361v1 by Yacine Izza, Alexey Ignatiev, Xuanxiang Huang, Peter J. Stuckey, Joao Marques-Silva

Tree ensembles (TEs) find a multitude of practical applications. They represent one of the most general and accurate classes of machine learning methods. While they are typically quite concise in representation, their operation remains inscrutable to human decision makers. One solution to build trust in the operation of TEs is to automatically identify explanations for the predictions made. Evidently, we can only achieve trust using explanations, if those explanations are rigorous, that is truly reflect properties of the underlying predictor they explain This paper investigates the computation of rigorously-defined, logically-sound explanations for the concrete case of two well-known examples of tree ensembles, namely random forests and boosted trees.

摘要:樹集成(TEs)在許多實際應用中發揮著重要作用。它們代表了機器學習方法中最通用且準確的類別之一。雖然它們的表達通常相當簡潔,但其運作對於人類決策者來說仍然難以理解。建立對樹集成運作信任的一個解決方案是自動識別所做預測的解釋。顯然,只有當這些解釋是嚴謹的,即真正反映了它們所解釋的基礎預測器的特性時,我們才能通過解釋來獲得信任。本文研究了對兩個知名樹集成示例,即隨機森林和提升樹,進行嚴謹定義且邏輯上合理的解釋的計算。

Predicting Neuromodulation Outcome for Parkinson's Disease with Generative Virtual Brain Model

2603.29176v1 by Siyuan Du, Siyi Li, Shuwei Bai, Ang Li, Haolin Li, Mingqing Xiao, Yang Pan, Dongsheng Li, Weidi Xie, Yanfeng Wang, Ya Zhang, Chencheng Zhang, Jiangchao Yao

Parkinson's disease (PD) affects over ten million people worldwide. Although temporal interference (TI) and deep brain stimulation (DBS) are promising therapies, inter-individual variability limits empirical treatment selection, increasing non-negligible surgical risk and cost. Previous explorations either resort to limited statistical biomarkers that are insufficient to characterize variability, or employ AI-driven methods which is prone to overfitting and opacity. We bridge this gap with a pretraining-finetuning framework to predict outcomes directly from resting-state fMRI. Critically, a generative virtual brain foundation model, pretrained on a collective dataset (2707 subjects, 5621 sessions) to capture universal disorder patterns, was finetuned on PD cohorts receiving TI (n=51) or DBS (n=55) to yield individualized virtual brains with high fidelity to empirical functional connectivity (r=0.935). By constructing counterfactual estimations between pathological and healthy neural states within these personalized models, we predicted clinical responses (TI: AUPR=0.853; DBS: AUPR=0.915), substantially outperforming baselines. External and prospective validations (n=14, n=11) highlight the feasibility of clinical translation. Moreover, our framework provides state-dependent regional patterns linked to response, offering hypothesis-generating mechanistic insights.

摘要:帕金森病(PD)影響全球超過一千萬人。儘管時間干擾(TI)和深腦刺激(DBS)是有前景的療法,但個體間的變異性限制了實證治療的選擇,增加了不可忽視的手術風險和成本。先前的探索要麼依賴於有限的統計生物標記,這些標記不足以表徵變異性,要麼使用易於過擬合和不透明的 AI 驅動方法。我們通過一個預訓練-微調框架來填補這一空白,直接從靜息態 fMRI 預測結果。關鍵是,一個生成的虛擬大腦基礎模型,在一個集體數據集(2707 名受試者,5621 次會議)上進行預訓練,以捕捉普遍的疾病模式,並在接受 TI(n=51)或 DBS(n=55)的 PD 群體上進行微調,以產生與實證功能連接高度一致的個性化虛擬大腦(r=0.935)。通過在這些個性化模型中構建病理和健康神經狀態之間的反事實估計,我們預測了臨床反應(TI: AUPR=0.853; DBS: AUPR=0.915),顯著超越基準。外部和前瞻性驗證(n=14, n=11)突顯了臨床轉化的可行性。此外,我們的框架提供了與反應相關的狀態依賴區域模式,提供了生成假設的機制見解。

Knowledge database development by large language models for countermeasures against viruses and marine toxins

2603.29149v1 by Hung N. Do, Jessica Z. Kubicek-Sutherland, S. Gnanakaran

Access to the most up-to-date information on medical countermeasures is important for the research and development of effective treatments for viruses and marine toxins. However, there is a lack of comprehensive databases that curate data on viruses and marine toxins, making decisions on medical countermeasures slow and difficult. In this work, we employ two large language models (LLMs) of ChatGPT and Grok to design two comprehensive databases of therapeutic countermeasures for five viruses of Lassa, Marburg, Ebola, Nipah, and Venezuelan equine encephalitis, as well as marine toxins. With high-level human-provided inputs, the two LLMs identify public databases containing data on the five viruses and marine toxins, collect relevant information from these databases and the literature, iteratively cross-validate the collected information, and design interactive webpages for easy access to the curated, comprehensive databases. Notably, the ChatGPT LLM is employed to design agentic AI workflows (consisting of two AI agents for research and decision-making) to rank countermeasures for viruses and marine toxins in the databases. Together, our work explores the potential of LLMs as a scalable, updatable approach for building comprehensive knowledge databases and supporting evidence-based decision-making.

摘要:獲取有關醫療對策的最新資訊對於病毒和海洋毒素的有效治療的研究與開發至關重要。然而,缺乏綜合數據庫來整理有關病毒和海洋毒素的數據,使得醫療對策的決策變得緩慢且困難。在這項工作中,我們使用兩個大型語言模型(LLMs),即ChatGPT和Grok,設計了兩個針對拉薩病毒、馬爾堡病毒、埃博拉病毒、尼帕病毒和委內瑞拉馬腦炎病毒以及海洋毒素的治療對策的綜合數據庫。在高層次的人類提供的輸入下,這兩個LLMs識別出包含五種病毒和海洋毒素數據的公共數據庫,從這些數據庫和文獻中收集相關信息,迭代交叉驗證所收集的信息,並設計互動網頁以便於訪問整理好的綜合數據庫。值得注意的是,ChatGPT LLM被用來設計代理式AI工作流程(由兩個AI代理組成,分別負責研究和決策),以對數據庫中的病毒和海洋毒素對策進行排名。總之,我們的工作探討了LLMs作為可擴展、可更新的方法來建立綜合知識數據庫並支持基於證據的決策的潛力。

Towards Explainable Stakeholder-Aware Requirements Prioritisation in Aged-Care Digital Health

2603.29114v1 by Yuqing Xiao, John Grundy, Anuradha Madugalla, Elizabeth Manias

Requirements engineering for aged-care digital health must account for human aspects, because requirement priorities are shaped not only by technical functionality but also by stakeholders' health conditions, socioeconomics, and lived experience. Knowing which human aspects matter most, and for whom, is critical for inclusive and evidence-based requirements prioritisation. Yet in practice, while some studies have examined human aspects in RE, they have largely relied on expert judgement or model-driven analysis rather than large-scale user studies with meaningful human-in-the-loop validation to determine which aspects matter most and why. To address this gap, we conducted a mixed-methods study with 103 older adults, 105 developers, and 41 caregivers. We first applied an explainable machine learning to identify the human aspects most strongly associated with requirement priorities across 8 aged-care digital health themes, and then conducted 12 semi-structured interviews to validate and interpret the quantitative patterns. The results identify the key human aspects shaping requirement priorities, reveal their directional effects, and expose substantial misalignment across stakeholder groups. Together, these findings show that human-centric requirements analysis should engage stakeholder groups explicitly rather than collapsing their perspectives into a single aggregate view. This paper contributes an identification of the key human aspects driving requirement priorities in aged-care digital health and an explainable, human-centric RE framework that combines ML-derived importance rankings with qualitative validation to surface the stakeholder misalignments that inclusive requirements engineering must address.

摘要:老年護理數位健康的需求工程必須考慮人類因素,因為需求優先級不僅受到技術功能的影響,還受到利益相關者的健康狀況、社會經濟狀況和生活經驗的影響。了解哪些人類因素最重要,以及對誰最重要,對於包容性和基於證據的需求優先級排序至關重要。然而在實踐中,儘管一些研究已經考察了需求工程中的人類因素,但它們在很大程度上依賴於專家判斷或模型驅動的分析,而非大規模的用戶研究,這些研究具有意義的人類參與驗證,以確定哪些因素最重要及其原因。為了填補這一空白,我們對103位老年人、105位開發者和41位護理人員進行了混合方法研究。我們首先應用可解釋的機器學習來識別與8個老年護理數位健康主題的需求優先級最強相關的人類因素,然後進行了12次半結構化訪談,以驗證和解釋定量模式。結果確定了塑造需求優先級的關鍵人類因素,揭示了它們的方向性影響,並暴露了利益相關者群體之間的重大不一致性。這些發現表明,以人為中心的需求分析應該明確地吸引利益相關者群體,而不是將他們的觀點合併為單一的總體觀。本文貢獻了對推動老年護理數位健康需求優先級的關鍵人類因素的識別,以及一個可解釋的以人為中心的需求工程框架,該框架結合了機器學習導出的重要性排名與定性驗證,以揭示包容性需求工程必須解決的利益相關者不一致性。

A Latent Risk-Aware Machine Learning Approach for Predicting Operational Success in Clinical Trials based on TrialsBank

2603.29041v1 by Iness Halimi, Emmanuel Piffo, Oumnia Boudersa, Yvan Marcel Carre Vilmorin, Melissa Ait-ikhlef, Karima Kone, Andy Tan, Augustin Medina, Juliette Hernando, Sheila Ernest, Vatche Bartekian, Karine Lalonde, Mireille E Schnitzer, Gianolli Dorcelus

Clinical trials are characterized by high costs, extended timelines, and substantial operational risk, yet reliable prospective methods for predicting trial success before initiation remain limited. Existing artificial intelligence approaches often focus on isolated metrics or specific development stages and frequently rely on variables unavailable at the trial design phase, limiting real-world applicability. We present a hierarchical latent risk-aware machine learning framework for prospective prediction of clinical trial operational success using a curated subset of TrialsBank, a proprietary AI-ready database developed by Sorintellis, comprising 13,700 trials. Operational success was defined as the ability to initiate, conduct, and complete a clinical trial according to planned timelines, recruitment targets, and protocol specifications through database lock. This approach decomposes operational success prediction into two modeling stages. First, intermediate latent operational risk factors are predicted using more than 180 drug- and trial-level features available before trial initiation. These predicted latent risks are then integrated into a downstream model to estimate the probability of operational success. A staged data-splitting strategy was employed to prevent information leakage, and models were benchmarked using XGBoost, CatBoost, and Explainable Boosting Machines. Across Phase I-III, the framework achieves strong out-of-sample performance, with F1-scores of 0.93, 0.92, and 0.91, respectively. Incorporating latent risk drivers improves discrimination of operational failures, and performance remains robust under independent inference evaluation. These results demonstrate that clinical trial operational success can be prospectively forecasted using a latent risk-aware AI framework, enabling early risk assessment and supporting data-driven clinical development decision-making.

摘要:臨床試驗的特點是高成本、長時間和相當大的操作風險,然而在試驗啟動之前,可靠的預測試驗成功的前瞻性方法仍然有限。現有的人工智慧方法通常專注於孤立的指標或特定的開發階段,並且經常依賴於在試驗設計階段無法獲得的變數,限制了其在現實世界中的適用性。我們提出了一個層次性潛在風險感知的機器學習框架,用於利用經過整理的TrialsBank子集預測臨床試驗的操作成功,TrialsBank是由Sorintellis開發的專有AI準備數據庫,包含13,700個試驗。操作成功被定義為根據計劃的時間表、招募目標和協議規範,啟動、進行和完成臨床試驗的能力,直到數據庫鎖定。這種方法將操作成功的預測分解為兩個建模階段。首先,使用在試驗啟動之前可用的180多個藥物和試驗級別特徵來預測中間潛在操作風險因素。這些預測的潛在風險然後被整合到下游模型中,以估計操作成功的概率。採用了分階段數據拆分策略以防止信息洩漏,並使用XGBoost、CatBoost和可解釋增強機器進行模型基準測試。在第一至第三階段,該框架在樣本外表現強勁,F1分數分別為0.93、0.92和0.91。納入潛在風險驅動因素提高了對操作失敗的區分能力,並且在獨立推斷評估下性能依然穩健。這些結果表明,臨床試驗的操作成功可以通過潛在風險感知的AI框架進行前瞻性預測,使早期風險評估成為可能,並支持基於數據的臨床開發決策。

Towards a Medical AI Scientist

2603.28589v1 by Hongtao Wu, Boyun Zheng, Dingjie Song, Yu Jiang, Jianfeng Gao, Lei Xing, Lichao Sun, Yixuan Yuan

Autonomous systems that generate scientific hypotheses, conduct experiments, and draft manuscripts have recently emerged as a promising paradigm for accelerating discovery. However, existing AI Scientists remain largely domain-agnostic, limiting their applicability to clinical medicine, where research is required to be grounded in medical evidence with specialized data modalities. In this work, we introduce Medical AI Scientist, the first autonomous research framework tailored to clinical autonomous research. It enables clinically grounded ideation by transforming extensively surveyed literature into actionable evidence through clinician-engineer co-reasoning mechanism, which improves the traceability of generated research ideas. It further facilitates evidence-grounded manuscript drafting guided by structured medical compositional conventions and ethical policies. The framework operates under 3 research modes, namely paper-based reproduction, literature-inspired innovation, and task-driven exploration, each corresponding to a distinct level of automated scientific inquiry with progressively increasing autonomy. Comprehensive evaluations by both large language models and human experts demonstrate that the ideas generated by the Medical AI Scientist are of substantially higher quality than those produced by commercial LLMs across 171 cases, 19 clinical tasks, and 6 data modalities. Meanwhile, our system achieves strong alignment between the proposed method and its implementation, while also demonstrating significantly higher success rates in executable experiments. Double-blind evaluations by human experts and the Stanford Agentic Reviewer suggest that the generated manuscripts approach MICCAI-level quality, while consistently surpassing those from ISBI and BIBM. The proposed Medical AI Scientist highlights the potential of leveraging AI for autonomous scientific discovery in healthcare.

摘要:自主系統生成科學假設、進行實驗並撰寫手稿,最近已成為加速發現的一個有前景的範式。然而,現有的人工智慧科學家在很大程度上仍然是領域無關的,這限制了它們在臨床醫學中的應用,因為研究需要基於醫學證據並具有專門的數據模式。在這項工作中,我們介紹了醫療人工智慧科學家,這是第一個專為臨床自主研究量身定制的自主研究框架。它通過臨床醫生與工程師的共同推理機制,將廣泛調查的文獻轉化為可行的證據,從而實現臨床基礎的創意,並提高生成研究想法的可追溯性。它進一步促進了基於證據的手稿撰寫,並遵循結構化的醫學組合慣例和倫理政策。該框架在三種研究模式下運作,即基於論文的重現、受文獻啟發的創新和任務驅動的探索,每種模式對應於不同層次的自動化科學探究,並逐步增加自主性。大型語言模型和人類專家的全面評估顯示,醫療人工智慧科學家生成的想法在171個案例、19個臨床任務和6種數據模式中,質量顯著高於商業LLM所產生的想法。同時,我們的系統在所提出的方法與其實施之間實現了強大的對齊,並在可執行實驗中顯示出顯著更高的成功率。人類專家和斯坦福代理審稿人的雙盲評估表明,生成的手稿接近MICCAI級別的質量,同時穩定超越來自ISBI和BIBM的手稿。所提出的醫療人工智慧科學家突顯了利用人工智慧進行醫療保健自主科學發現的潛力。

Detecting low left ventricular ejection fraction from ECG using an interpretable and scalable predictor-driven framework

2603.28532v1 by Ya Zhou, Tianxiang Hao, Ziyi Cai, Haojie Zhu, Hejun He, Jia Liu, Xiaohan Fan, Jing Yuan

Low left ventricular ejection fraction (LEF) frequently remains undetected until progression to symptomatic heart failure, underscoring the need for scalable screening strategies. Although artificial intelligence-enabled electrocardiography (AI-ECG) has shown promise, existing approaches rely solely on end-to-end black-box models with limited interpretability or on tabular systems dependent on commercial ECG measurement algorithms with suboptimal performance. We introduced ECG-based Predictor-Driven LEF (ECGPD-LEF), a structured framework that integrates foundation model-derived diagnostic probabilities with interpretable modeling for detecting LEF from ECG. Trained on the benchmark EchoNext dataset comprising 72,475 ECG-echocardiogram pairs and evaluated in predefined independent internal (n=5,442) and external (n=16,017) cohorts, our framework achieved robust discrimination for moderate LEF (internal AUROC 88.4%, F1 64.5%; external AUROC 86.8%, F1 53.6%), consistently outperforming the official end-to-end baseline provided with the benchmark across demographic and clinical subgroups. Interpretability analyses identified high-impact predictors, including normal ECG, incomplete left bundle branch block, and subendocardial injury in anterolateral leads, driving LEF risk estimation. Notably, these predictors independently enabled zero-shot-like inference without task-specific retraining (internal AUROC 75.3-81.0%; external AUROC 71.6-78.6%), indicating that ventricular dysfunction is intrinsically encoded within structured diagnostic probability representations. This framework reconciles predictive performance with mechanistic transparency, supporting scalable enhancement through additional predictors and seamless integration with existing AI-ECG systems.

摘要:低左心室射血分數(LEF)常常在進展到有症狀的心臟衰竭之前未被檢測到,這突顯了可擴展篩查策略的必要性。儘管人工智慧驅動的心電圖(AI-ECG)顯示出潛力,但現有的方法僅依賴於端到端的黑箱模型,這些模型的可解釋性有限,或依賴於依賴商業心電圖測量算法的表格系統,這些算法的性能不佳。我們引入了基於心電圖的預測驅動LEF(ECGPD-LEF),這是一個結構化框架,將基礎模型衍生的診斷概率與可解釋建模相結合,以從心電圖中檢測LEF。該框架在基準EchoNext數據集上進行訓練,該數據集包含72,475對心電圖-超聲心動圖,並在預定的獨立內部(n=5,442)和外部(n=16,017)隊列中進行評估,我們的框架在中度LEF的區分上達到了穩健的表現(內部AUROC 88.4%,F1 64.5%;外部AUROC 86.8%,F1 53.6%),在各個人口和臨床子群中始終優於基準提供的官方端到端基線。可解釋性分析確定了高影響力的預測因子,包括正常心電圖、不完全的左束支傳導阻滯和前外側導聯的心內膜下損傷,這些因子驅動了LEF風險評估。值得注意的是,這些預測因子獨立地實現了類似零樣本的推斷,而無需特定任務的再訓練(內部AUROC 75.3-81.0%;外部AUROC 71.6-78.6%),這表明心室功能障礙本質上在結構化診斷概率表示中被編碼。這一框架調和了預測性能與機制透明度,支持通過額外的預測因子和與現有AI-ECG系統的無縫整合來實現可擴展的增強。

RAD-LAD: Rule and Language Grounded Autonomous Driving in Real-Time

2603.28522v2 by Anurag Ghosh, Srinivasa Narasimhan, Manmohan Chandraker, Francesco Pittaluga

We present LAD, a real-time language--action planner with an interruptible architecture that produces a motion plan in a single forward pass (~20 Hz) or generates textual reasoning alongside a motion plan (~10 Hz). LAD is fast enough for real-time closed-loop deployment, achieving ~3x lower latency than prior driving language models while setting a new learning-based state of the art on nuPlan Test14-Hard and InterPlan. We also introduce RAD, a rule-based planner designed to address structural limitations of PDM-Closed. RAD achieves state-of-the-art performance among rule-based planners on nuPlan Test14-Hard and InterPlan. Finally, we show that combining RAD and LAD enables hybrid planning that captures the strengths of both approaches. This hybrid system demonstrates that rules and learning provide complementary capabilities: rules support reliable maneuvering, while language enables adaptive and explainable decision-making.

摘要:我們提出了 LAD,一種具備可中斷架構的即時語言—行動規劃器,能在單次前向傳遞中生成運動計畫(約 20 Hz),或在生成運動計畫的同時產生文本推理(約 10 Hz)。LAD 的速度足以用於即時閉環部署,達到比先前的駕駛語言模型低約 3 倍的延遲,同時在 nuPlan Test14-Hard 和 InterPlan 上設立了新的基於學習的最先進技術。我們還介紹了 RAD,一種基於規則的規劃器,旨在解決 PDM-Closed 的結構性限制。RAD 在 nuPlan Test14-Hard 和 InterPlan 上的基於規則的規劃器中達到了最先進的性能。最後,我們展示了結合 RAD 和 LAD 的混合規劃,能夠捕捉兩種方法的優勢。這個混合系統證明了規則和學習提供了互補的能力:規則支持可靠的操控,而語言則使得適應性和可解釋的決策成為可能。

The Unreasonable Effectiveness of Scaling Laws in AI

2603.28507v1 by Chien-Ping Lu

Classical AI scaling laws, especially for pre-training, describe how training loss decreases with compute in a power-law form. Their effectiveness has a basic and very practical sense: they make progress predictable, albeit at a declining rate. Yet their effectiveness is also unreasonable in two further senses. First, these laws are largely empirical and observational, but they appear repeatedly across model families and increasingly across training-adjacent regimes. Second, despite the diminishing returns they predict, progress in practice has often continued through rapidly improving efficiency, visible for example in falling cost per token. This paper argues that both features arise from the same source: scaling laws are unusually effective because they abstract away from many realization details. The compute variable is best understood as logical compute, an implementation-agnostic notion of model-side work, while the practical burden of scaling depends on how efficiently real resources are converted into that compute. This abstraction helps explain both why the laws travel so well across settings and why they give rise to a persistent efficiency game in hardware, algorithms, and systems. Once efficiency is made explicit, the main practical question becomes how many efficiency doublings are required to keep scaling productive despite diminishing returns. Under that view, diminishing returns are not only a geometric flattening of the loss curve, but also rising pressure for cost reduction, system-level innovation, and the breakthroughs needed to sustain Moore-like efficiency doublings.

摘要:古典人工智慧的擴展法則,尤其是在預訓練方面,描述了訓練損失如何以冪律形式隨著計算量的增加而減少。它們的有效性在基本上和非常實際的意義上是明顯的:它們使進展可預測,儘管以下降的速度進行。然而,它們的有效性在另外兩個方面也是不合理的。首先,這些法則在很大程度上是經驗性和觀察性的,但它們在模型家族中反覆出現,並且在與訓練相關的領域中越來越普遍。其次,儘管它們預測了收益遞減,實際上進展往往通過快速提高效率而持續,這在每個標記的成本下降中可見。本文主張這兩個特徵源於同一來源:擴展法則異常有效,因為它們抽象掉了許多實現細節。計算變量最好理解為邏輯計算,這是一種與實現無關的模型側工作概念,而擴展的實際負擔則取決於真實資源轉換為該計算的效率。這種抽象有助於解釋為什麼這些法則能夠在不同環境中如此有效,以及為什麼它們引發了硬體、算法和系統中的持續效率競賽。一旦效率變得明確,主要的實際問題便是需要多少次效率翻倍才能在收益遞減的情況下保持擴展的生產力。從這個角度看,收益遞減不僅是損失曲線的幾何平坦化,還是對成本降低、系統級創新以及維持摩爾式效率翻倍所需突破的上升壓力。

CiQi-Agent: Aligning Vision, Tools and Aesthetics in Multimodal Agent for Cultural Reasoning on Chinese Porcelains

2603.28474v1 by Wenhan Wang, Zhixiang Zhou, Zhongtian Ma, Yanzhu Chen, Ziyu Lin, Hao Sheng, Pengfei Liu, Honglin Ma, Wenqi Shao, Qiaosheng Zhang, Yu Qiao

The connoisseurship of antique Chinese porcelain demands extensive historical expertise, material understanding, and aesthetic sensitivity, making it difficult for non-specialists to engage. To democratize cultural-heritage understanding and assist expert connoisseurship, we introduce CiQi-Agent -- a domain-specific Porcelain Connoisseurship Agent for intelligent analysis of antique Chinese porcelain. CiQi-Agent supports multi-image porcelain inputs and enables vision tool invocation and multimodal retrieval-augmented generation, performing fine-grained connoisseurship analysis across six attributes: dynasty, reign period, kiln site, glaze color, decorative motif, and vessel shape. Beyond attribute classification, it captures subtle visual details, retrieves relevant domain knowledge, and integrates visual and textual evidence to produce coherent, explainable connoisseurship descriptions. To achieve this capability, we construct a large-scale, expert-annotated dataset CiQi-VQA, comprising 29,596 porcelain specimens, 51,553 images, and 557,940 visual question--answering pairs, and further establish a comprehensive benchmark CiQi-Bench aligned with the previously mentioned six attributes. CiQi-Agent is trained through supervised fine-tuning, reinforcement learning, and a tool-augmented reasoning framework that integrates two categories of tools: a vision tool and multimodal retrieval tools. Experimental results show that CiQi-Agent (7B) outperforms all competitive open- and closed-source models across all six attributes on CiQi-Bench, achieving on average 12.2\% higher accuracy than GPT-5. The model and dataset have been released and are publicly available at https://huggingface.co/datasets/SII-Monument-Valley/CiQi-VQA.

摘要:古董中國瓷器的鑑賞需要廣泛的歷史專業知識、材料理解和美學敏感性,使得非專家難以參與。為了使文化遺產的理解民主化並協助專家鑑賞,我們推出了 CiQi-Agent —— 一個專門針對古董中國瓷器的鑑賞代理,旨在進行智能分析。CiQi-Agent 支持多圖像瓷器輸入,並啟用視覺工具調用和多模態檢索增強生成,對六個屬性進行細緻的鑑賞分析:朝代、統治時期、窯址、釉色、裝飾圖案和器型。除了屬性分類外,它還捕捉微妙的視覺細節,檢索相關的領域知識,並整合視覺和文本證據,以產生連貫且可解釋的鑑賞描述。為了實現這一能力,我們構建了一個大規模的專家標註數據集 CiQi-VQA,包含 29,596 件瓷器標本、51,553 張圖像和 557,940 對視覺問題--回答,並進一步建立了一個與上述六個屬性對齊的綜合基準 CiQi-Bench。CiQi-Agent 通過監督微調、強化學習和一個整合了兩類工具的工具增強推理框架進行訓練:一個視覺工具和多模態檢索工具。實驗結果顯示,CiQi-Agent (7B) 在 CiQi-Bench 上的所有六個屬性上都超越了所有競爭的開源和閉源模型,平均準確率比 GPT-5 高出 12.2%。該模型和數據集已經發布,並可在 https://huggingface.co/datasets/SII-Monument-Valley/CiQi-VQA 上公開獲取。

The Scaffold Effect: How Prompt Framing Drives Apparent Multimodal Gains in Clinical VLM Evaluation

2603.28387v1 by Doan Nam Long Vu, Simone Balloccu

Trustworthy clinical AI requires that performance gains reflect genuine evidence integration rather than surface-level artifacts. We evaluate 12 open-weight vision-language models (VLMs) on binary classification across two clinical neuroimaging cohorts, \textsc{FOR2107} (affective disorders) and \textsc{OASIS-3} (cognitive decline). Both datasets come with structural MRI data that carries no reliable individual-level diagnostic signal. Under these conditions, smaller VLMs exhibit gains of up to 58\% F1 upon introduction of neuroimaging context, with distilled models becoming competitive with counterparts an order of magnitude larger. A contrastive confidence analysis reveals that merely \emph{mentioning} MRI availability in the task prompt accounts for 70-80\% of this shift, independent of whether imaging data is present, a domain-specific instance of modality collapse we term the \emph{scaffold effect}. Expert evaluation reveals fabrication of neuroimaging-grounded justifications across all conditions, and preference alignment, while eliminating MRI-referencing behavior, collapses both conditions toward random baseline. Our findings demonstrate that surface evaluations are inadequate indicators of multimodal reasoning, with direct implications for the deployment of VLMs in clinical settings.

摘要:值得信賴的臨床人工智慧要求性能提升反映真實的證據整合,而非表面層次的人工產物。我們在兩個臨床神經影像學群體上對12個開放權重的視覺-語言模型(VLMs)進行二元分類評估,\textsc{FOR2107}(情感障礙)和\textsc{OASIS-3}(認知衰退)。這兩個數據集都包含結構性MRI數據,但不帶有可靠的個體級診斷信號。在這些條件下,較小的VLM在引入神經影像學背景後,F1得分提升高達58\%,而經過提煉的模型與規模大一個數量級的對應模型競爭。對比信心分析顯示,僅僅在任務提示中\emph{提及}MRI的可用性就佔據了70-80\%的變化,與影像數據是否存在無關,這是一個我們稱之為\emph{支架效應}的領域特定的模態崩潰實例。專家評估顯示在所有條件下均存在基於神經影像的理由的虛構,而偏好對齊在消除MRI參考行為的同時,使兩個條件都趨向隨機基線。我們的發現表明,表面評估並不足以作為多模態推理的指標,這對於在臨床環境中部署VLM有直接的影響。

Mapping data literacy trajectories in K-12 education

2603.28317v1 by Robert Whyte, Manni Cheung, Katharine Childs, Jane Waite, Sue Sentance

Data literacy skills are fundamental in computer science education. However, understanding how data-driven systems work represents a paradigm shift from traditional rule-based programming. We conducted a systematic literature review of 84 studies to understand K-12 learners' engagement with data across disciplines and contexts. We propose the data paradigms framework that categorises learning activities along two dimensions: (i) logic (knowledge-based or data-driven systems), and (ii) explainability (transparent or opaque models). We further apply the notion of learning trajectories to visualize the pathways learners follow across these distinct paradigms. We detail four distinct trajectories as a provocation for researchers and educators to reflect on how the notion of data literacy varies depending on the learning context. We suggest these trajectories could be useful to those concerned with the design of data literacy learning environments within and beyond CS education.

摘要:數據素養技能在計算機科學教育中是基本的。 然而,理解數據驅動系統的運作代表了從傳統基於規則的編程的範式轉變。 我們進行了一項系統的文獻回顧,分析了84項研究,以了解K-12學習者在不同學科和背景下與數據的互動。 我們提出了數據範式框架,將學習活動沿著兩個維度進行分類:(i)邏輯(基於知識或數據驅動系統),以及(ii)可解釋性(透明或不透明模型)。 我們進一步應用學習軌跡的概念來可視化學習者在這些不同範式之間所遵循的路徑。 我們詳細描述了四條不同的軌跡,以激發研究人員和教育工作者反思數據素養的概念如何根據學習背景而有所不同。 我們建議這些軌跡對於關心數據素養學習環境設計的人士,無論是在計算機科學教育內部還是外部,都可能是有用的。

Adversarial Attacks on Multimodal Large Language Models: A Comprehensive Survey

2603.27918v1 by Bhavuk Jain, Sercan Ö. Arık, Hardeo K. Thakur

Multimodal large language models (MLLMs) integrate information from multiple modalities such as text, images, audio, and video, enabling complex capabilities such as visual question answering and audio translation. While powerful, this increased expressiveness introduces new and amplified vulnerabilities to adversarial manipulation. This survey provides a comprehensive and systematic analysis of adversarial threats to MLLMs, moving beyond enumerating attack techniques to explain the underlying causes of model susceptibility. We introduce a taxonomy that organizes adversarial attacks according to attacker objectives, unifying diverse attack surfaces across modalities and deployment settings. Additionally, we also present a vulnerability-centric analysis that links integrity attacks, safety and jailbreak failures, control and instruction hijacking, and training-time poisoning to shared architectural and representational weaknesses in multimodal systems. Together, this framework provides an explanatory foundation for understanding adversarial behavior in MLLMs and informs the development of more robust and secure multimodal language systems.

摘要:多模態大型語言模型(MLLMs)整合來自多種模態的信息,如文本、圖像、音頻和視頻,使得複雜的能力得以實現,例如視覺問答和音頻翻譯。雖然功能強大,但這種增強的表達能力也引入了新的和加劇的對抗性操控脆弱性。這項調查提供了對MLLMs對抗性威脅的全面和系統性分析,超越了僅僅列舉攻擊技術,解釋模型易受攻擊的根本原因。我們引入了一種分類法,根據攻擊者的目標組織對抗性攻擊,統一了不同模態和部署設置中的多樣攻擊面。此外,我們還提出了一種以脆弱性為中心的分析,將完整性攻擊、安全性和越獄失敗、控制和指令劫持以及訓練時間的毒化連結到多模態系統中共享的架構和表徵弱點。總體而言,這一框架為理解MLLMs中的對抗行為提供了解釋基礎,並為開發更強大和安全的多模態語言系統提供了指導。

ImagenWorld: Stress-Testing Image Generation Models with Explainable Human Evaluation on Open-ended Real-World Tasks

2603.27862v1 by Samin Mahdizadeh Sani, Max Ku, Nima Jamali, Matina Mahdizadeh Sani, Paria Khoshtab, Wei-Chieh Sun, Parnian Fazel, Zhi Rui Tam, Thomas Chong, Edisy Kin Wai Chan, Donald Wai Tong Tsang, Chiao-Wei Hsu, Ting Wai Lam, Ho Yin Sam Ng, Chiafeng Chu, Chak-Wing Mak, Keming Wu, Hiu Tung Wong, Yik Chun Ho, Chi Ruan, Zhuofeng Li, I-Sheng Fang, Shih-Ying Yeh, Ho Kei Cheng, Ping Nie, Wenhu Chen

Advances in diffusion, autoregressive, and hybrid models have enabled high-quality image synthesis for tasks such as text-to-image, editing, and reference-guided composition. Yet, existing benchmarks remain limited, either focus on isolated tasks, cover only narrow domains, or provide opaque scores without explaining failure modes. We introduce \textbf{ImagenWorld}, a benchmark of 3.6K condition sets spanning six core tasks (generation and editing, with single or multiple references) and six topical domains (artworks, photorealistic images, information graphics, textual graphics, computer graphics, and screenshots). The benchmark is supported by 20K fine-grained human annotations and an explainable evaluation schema that tags localized object-level and segment-level errors, complementing automated VLM-based metrics. Our large-scale evaluation of 14 models yields several insights: (1) models typically struggle more in editing tasks than in generation tasks, especially in local edits. (2) models excel in artistic and photorealistic settings but struggle with symbolic and text-heavy domains such as screenshots and information graphics. (3) closed-source systems lead overall, while targeted data curation (e.g., Qwen-Image) narrows the gap in text-heavy cases. (4) modern VLM-based metrics achieve Kendall accuracies up to 0.79, approximating human ranking, but fall short of fine-grained, explainable error attribution. ImagenWorld provides both a rigorous benchmark and a diagnostic tool to advance robust image generation.

摘要:進展於擴散、自回歸和混合模型使得高品質圖像合成成為可能,應用於文本到圖像、編輯和參考引導的構圖等任務。 然而,現有的基準仍然有限,或專注於孤立任務,或僅涵蓋狹窄領域,或提供不透明的分數而未解釋失敗模式。 我們介紹\textbf{ImagenWorld},這是一個包含3.6K條件集的基準,涵蓋六個核心任務(生成和編輯,單一或多重參考)和六個主題領域(藝術作品、逼真圖像、信息圖形、文本圖形、計算機圖形和截圖)。 該基準得到了20K細緻的人類註釋和一個可解釋的評估架構的支持,該架構標記了局部物體級和段落級錯誤,補充了自動化的VLM基準指標。 我們對14個模型的大規模評估產生了幾個見解:(1)模型在編輯任務中通常比在生成任務中更具挑戰性,特別是在局部編輯方面。(2)模型在藝術和逼真設置中表現優異,但在符號和文本密集的領域(如截圖和信息圖形)中表現不佳。(3)封閉源系統整體領先,而針對性的數據策劃(例如Qwen-Image)在文本密集的情況下縮小了差距。(4)現代VLM基準指標達到高達0.79的Kendall準確度,接近人類排名,但在細緻的可解釋錯誤歸因方面仍然不足。 ImagenWorld提供了一個嚴謹的基準和診斷工具,以推進穩健的圖像生成。

What-If Explanations Over Time: Counterfactuals for Time Series Classification

2603.27792v1 by Udo Schlegel, Thomas Seidl

Counterfactual explanations emerge as a powerful approach in explainable AI, providing what-if scenarios that reveal how minimal changes to an input time series can alter the model's prediction. This work presents a survey of recent algorithms for counterfactual explanations for time series classification. We review state-of-the-art methods, spanning instance-based nearest-neighbor techniques, pattern-driven algorithms, gradient-based optimization, and generative models. For each, we discuss the underlying methodology, the models and classifiers they target, and the datasets on which they are evaluated. We highlight unique challenges in generating counterfactuals for temporal data, such as maintaining temporal coherence, plausibility, and actionable interpretability, which distinguish the temporal from tabular or image domains. We analyze the strengths and limitations of existing approaches and compare their effectiveness along key dimensions (validity, proximity, sparsity, plausibility, etc.). In addition, we implemented an open-source implementation library, Counterfactual Explanations for Time Series (CFTS), as a reference framework that includes many algorithms and evaluation metrics. We discuss this library's contributions in standardizing evaluation and enabling practical adoption of explainable time series techniques. Finally, based on the literature and identified gaps, we propose future research directions, including improved user-centered design, integration of domain knowledge, and counterfactuals for time series forecasting.

摘要:反事實解釋作為可解釋人工智慧中的一種強大方法,提供了假設情境,揭示了對輸入時間序列的最小變更如何改變模型的預測。這項工作呈現了最近針對時間序列分類的反事實解釋算法的調查。我們回顧了最先進的方法,涵蓋了基於實例的最近鄰技術、模式驅動算法、基於梯度的優化和生成模型。對於每一種方法,我們討論了其基本方法論、目標模型和分類器,以及其評估所用的數據集。我們突出了生成時間數據反事實的獨特挑戰,例如保持時間一致性、合理性和可操作的可解釋性,這些挑戰使得時間數據與表格或圖像領域有所區別。我們分析了現有方法的優勢和局限性,並在關鍵維度(有效性、接近性、稀疏性、合理性等)上比較了它們的有效性。此外,我們實現了一個開源實現庫,時間序列的反事實解釋(CFTS),作為一個參考框架,包含許多算法和評估指標。我們討論了這個庫在標準化評估和促進可解釋時間序列技術的實際採用方面的貢獻。最後,根據文獻和識別的空白,我們提出了未來的研究方向,包括改進以用戶為中心的設計、整合領域知識以及時間序列預測的反事實。

TianJi:An autonomous AI meteorologist for discovering physical mechanisms in atmospheric science

2603.27738v1 by Kaikai Zhang, Xiang Wang, Haoluo Zhao, Nan Chen, Mengyang Yu Jing-Jia Luo, Tao Song, Fan Meng

Artificial intelligence (AI) has achieved breakthroughs comparable to traditional numerical models in data-driven weather forecasting, yet it remains essentially statistical fitting and struggles to uncover the physical causal mechanisms of the atmosphere. Physics-oriented mechanism research still heavily relies on domain knowledge and cumbersome engineering operations of human scientists, becoming a bottleneck restricting the efficiency of Earth system science exploration. Here, we propose TianJi - the first "AI meteorologist" system capable of autonomously driving complex numerical models to verify physical mechanisms. Powered by a large language model-driven multi-agent architecture, TianJi can autonomously conduct literature research and generate scientific hypotheses. We further decouple scientific research into cognitive planning and engineering execution: the meta-planner interprets hypotheses and devises experimental roadmaps, while a cohort of specialized worker agents collaboratively complete data preparation, model configuration, and multi-dimensional result analysis. In two classic atmospheric dynamic scenarios (squall-line cold pools and typhoon track deflections), TianJi accomplishes expert-level end-to-end experimental operations with zero human intervention, compressing the research cycle to a few hours. It also delivers detailed result analyses and autonomously judges and explains the validity of the hypotheses from outputs. TianJi reveals that the role of AI in Earth system science is transitioning from a "black-box predictor" to an "interpretable scientific collaborator", offering a new paradigm for high-throughput exploration of scientific mechanisms.

摘要:人工智慧(AI)在數據驅動的天氣預測中取得了可與傳統數值模型相媲美的突破,然而它本質上仍然是統計擬合,並且在揭示大氣的物理因果機制方面面臨困難。以物理為導向的機制研究仍然在很大程度上依賴於領域知識和繁瑣的人類科學家的工程操作,這成為限制地球系統科學探索效率的瓶頸。在此,我們提出了天機——第一個能夠自主驅動複雜數值模型以驗證物理機制的「AI氣象學家」系統。天機由一個大型語言模型驅動的多代理架構提供支持,能夠自主進行文獻研究並生成科學假設。我們進一步將科學研究解耦為認知規劃和工程執行:元規劃者解釋假設並設計實驗路線圖,而一組專門的工作代理協作完成數據準備、模型配置和多維結果分析。在兩個經典的大氣動力學場景(突風線冷池和颱風路徑偏折)中,天機實現了專家級的端到端實驗操作,無需人類干預,將研究周期壓縮至幾小時。它還提供詳細的結果分析,自主判斷並解釋輸出結果中假設的有效性。天機揭示了AI在地球系統科學中的角色正在從「黑箱預測器」轉變為「可解釋的科學合作者」,為高通量科學機制探索提供了一種新範式。

Multi-Agent Dialectical Refinement for Enhanced Argument Classification

2603.27451v1 by Jakub Bąba, Jarosław A. Chudziak

Argument Mining (AM) is a foundational technology for automated writing evaluation, yet traditional supervised approaches rely heavily on expensive, domain-specific fine-tuning. While Large Language Models (LLMs) offer a training-free alternative, they often struggle with structural ambiguity, failing to distinguish between similar components like Claims and Premises. Furthermore, single-agent self-correction mechanisms often suffer from sycophancy, where the model reinforces its own initial errors rather than critically evaluating them. We introduce MAD-ACC (Multi-Agent Debate for Argument Component Classification), a framework that leverages dialectical refinement to resolve classification uncertainty. MAD-ACC utilizes a Proponent-Opponent-Judge model where agents defend conflicting interpretations of ambiguous text, exposing logical nuances that single-agent models miss. Evaluation on the UKP Student Essays corpus demonstrates that MAD-ACC achieves a Macro F1 score of 85.7%, significantly outperforming single-agent reasoning baselines, without requiring domain-specific training. Additionally, unlike "black-box" classifiers, MAD-ACC's dialectical approach offers a transparent and explainable alternative by generating human-readable debate transcripts that explain the reasoning behind decisions.

摘要:論證挖掘(AM)是自動寫作評估的基礎技術,然而傳統的監督式方法在很大程度上依賴於昂貴的特定領域微調。雖然大型語言模型(LLMs)提供了一種無需訓練的替代方案,但它們常常在結構性歧義上掙扎,無法區分相似的組件,如主張和前提。此外,單一代理的自我修正機制常常受到阿諛奉承的影響,模型強化了自身的初始錯誤,而不是對其進行批判性評估。我們介紹了MAD-ACC(多代理辯論以進行論證組件分類),這是一個利用辯證精煉來解決分類不確定性的框架。MAD-ACC採用支持者-反對者-裁判模型,代理人為模糊文本的衝突解釋辯護,揭示單一代理模型所忽略的邏輯細微差別。在UKP學生論文語料庫上的評估顯示,MAD-ACC達到了85.7%的宏觀F1分數,顯著超越了單一代理推理基準,且不需要特定領域的訓練。此外,與“黑箱”分類器不同,MAD-ACC的辯證方法通過生成可讀的辯論記錄來提供透明且可解釋的替代方案,解釋決策背後的推理。

Culturally Adaptive Explainable LLM Assessment for Multilingual Information Disorder: A Human-in-the-Loop Approach

2603.27356v1 by Maziar Kianimoghadam Jouneghani

Recognizing information disorder is difficult because judgments about manipulation depend on cultural and linguistic context. Yet current Large Language Models (LLMs) often behave as monocultural, English-centric "black boxes," producing fluent rationales that overlook localized framing. Preliminary evidence from the multilingual Information Disorder (InDor) corpus suggests that existing models struggle to explain manipulated news consistently across communities. To address this gap, this ongoing study proposes a Hybrid Intelligence Loop, a human-in-the-loop (HITL) framework that grounds model assessment in human-written rationales from native-speaking annotators. The approach moves beyond static target-language few-shot prompting by pairing English task instructions with dynamically retrieved target-language exemplars drawn from filtered InDor annotations through In-Context Learning (ICL). In the initial pilot, the Exemplar Bank is seeded from these filtered annotations and used to compare static and adaptive prompting on Farsi and Italian news. The study evaluates span and severity prediction, the quality and cultural appropriateness of generated rationales, and model alignment across evaluator groups, providing a testbed for culturally grounded explainable AI.

摘要:識別資訊混亂是困難的,因為對操控的判斷依賴於文化和語言背景。然而,目前的大型語言模型(LLMs)往往表現為單一文化、以英語為中心的「黑箱」,產生流暢的推理卻忽略了本地化的框架。來自多語言資訊混亂(InDor)語料庫的初步證據表明,現有模型在不同社群中解釋操控新聞的一致性上存在困難。為了解決這一差距,本研究提出了一個混合智慧循環(Hybrid Intelligence Loop),這是一個人類在循環中的(HITL)框架,將模型評估建立在母語註釋者撰寫的推理上。這種方法超越了靜態目標語言的少量提示,通過將英語任務指令與從過濾的InDor註釋中動態檢索的目標語言示例配對,實現了上下文學習(ICL)。在初步試點中,示例庫從這些過濾的註釋中生成,並用於比較波斯語和意大利語新聞的靜態與自適應提示。該研究評估了範圍和嚴重性預測、生成推理的質量和文化適宜性,以及不同評估者組之間的模型對齊,為文化根植的可解釋人工智慧提供了一個測試平台。

Improving Automated Wound Assessment Using Joint Boundary Segmentation and Multi-Class Classification Models

2603.27325v1 by Mehedi Hasan Tusar, Fateme Fayyazbakhsh, Igor Melnychuk, Ming C. Leu

Accurate wound classification and boundary segmentation are essential for guiding clinical decisions in both chronic and acute wound management. However, most existing AI models are limited, focusing on a narrow set of wound types or performing only a single task (segmentation or classification), which reduces their clinical applicability. This study presents a deep learning model based on YOLOv11 that simultaneously performs wound boundary segmentation (WBS) and wound classification (WC) across five clinically relevant wound types: burn injury (BI), pressure injury (PI), diabetic foot ulcer (DFU), vascular ulcer (VU), and surgical wound (SW). A wound-type balanced dataset of 2,963 annotated images was created to train the models for both tasks, with stratified five-fold cross-validation ensuring robust and unbiased evaluation. The models trained on the original non-augmented dataset achieved consistent performance across folds, though BI detection accuracy was relatively lower. Therefore, the dataset was augmented using rotation, flipping, and variations in brightness, saturation, and exposure to help the model learn more generalized and invariant features. This augmentation significantly improved model performance, particularly in detecting visually subtle BI cases. Among tested variants, YOLOv11x achieved the highest performance with F1-scores of 0.9341 (WBS) and 0.8736 (WC), while the lightweight YOLOv11n provided comparable accuracy at lower computational cost, making it suitable for resource-constrained deployments. Supported by confusion matrices and visual detection outputs, the results confirm the model's robustness against complex backgrounds and high intra-class variability, demonstrating the potential of YOLOv11-based architectures for accurate, real-time wound analysis in both clinical and remote care settings.

摘要:準確的傷口分類和邊界分割對於指導慢性和急性傷口管理中的臨床決策至關重要。然而,大多數現有的人工智慧模型都有限,專注於狹窄的傷口類型或僅執行單一任務(分割或分類),這降低了它們的臨床適用性。本研究提出了一個基於YOLOv11的深度學習模型,能同時執行五種臨床相關傷口類型的傷口邊界分割(WBS)和傷口分類(WC):燒傷(BI)、壓力傷(PI)、糖尿病足潰瘍(DFU)、血管潰瘍(VU)和手術傷口(SW)。為了訓練這兩項任務的模型,創建了一個包含2,963張註釋圖像的傷口類型平衡數據集,並通過分層五折交叉驗證確保了穩健和無偏的評估。在原始未增強數據集上訓練的模型在各折中表現一致,儘管BI檢測的準確性相對較低。因此,通過旋轉、翻轉以及亮度、飽和度和曝光的變化來增強數據集,以幫助模型學習更通用和不變的特徵。這種增強顯著改善了模型的性能,特別是在檢測視覺上微妙的BI案例方面。在測試的變體中,YOLOv11x以0.9341(WBS)和0.8736(WC)的F1分數達到了最高性能,而輕量級的YOLOv11n在較低的計算成本下提供了可比的準確性,使其適合資源有限的部署。通過混淆矩陣和視覺檢測輸出支持,結果確認了模型在複雜背景和高類內變異性下的穩健性,展示了基於YOLOv11的架構在臨床和遠程護理環境中進行準確實時傷口分析的潛力。

MediHive: A Decentralized Agent Collective for Medical Reasoning

2603.27150v1 by Xiaoyang Wang, Christopher C. Yang

Large language models (LLMs) have revolutionized medical reasoning tasks, yet single-agent systems often falter on complex, interdisciplinary problems requiring robust handling of uncertainty and conflicting evidence. Multi-agent systems (MAS) leveraging LLMs enable collaborative intelligence, but prevailing centralized architectures suffer from scalability bottlenecks, single points of failure, and role confusion in resource-constrained environments. Decentralized MAS (D-MAS) promise enhanced autonomy and resilience via peer-to-peer interactions, but their application to high-stakes healthcare domains remains underexplored. We introduce MediHive, a novel decentralized multi-agent framework for medical question answering that integrates a shared memory pool with iterative fusion mechanisms. MediHive deploys LLM-based agents that autonomously self-assign specialized roles, conduct initial analyses, detect divergences through conditional evidence-based debates, and locally fuse peer insights over multiple rounds to achieve consensus. Empirically, MediHive outperforms single-LLM and centralized baselines on MedQA and PubMedQA datasets, attaining accuracies of 84.3% and 78.4%, respectively. Our work advances scalable, fault-tolerant D-MAS for medical AI, addressing key limitations of centralized designs while demonstrating superior performance in reasoning-intensive tasks.

摘要:大型語言模型(LLMs)已經徹底改變了醫療推理任務,但單一代理系統在處理需要強大不確定性和衝突證據的複雜跨學科問題時,往往表現不佳。利用LLMs的多代理系統(MAS)能夠實現協作智能,但現有的集中式架構在資源有限的環境中面臨可擴展性瓶頸、單點故障和角色混淆的問題。去中心化的多代理系統(D-MAS)通過點對點互動承諾增強自主性和韌性,但其在高風險醫療領域的應用仍然未得到充分探索。我們介紹了MediHive,一個新穎的去中心化多代理框架,用於醫療問題回答,該框架整合了共享記憶池和迭代融合機制。MediHive 部署了基於LLM的代理,這些代理能夠自主自我分配專業角色,進行初步分析,通過條件證據辯論檢測分歧,並在多輪中本地融合同伴見解以達成共識。實證結果表明,MediHive在MedQA和PubMedQA數據集上的表現優於單一LLM和集中基準,分別達到84.3%和78.4%的準確率。我們的工作推進了可擴展、容錯的D-MAS在醫療AI中的應用,解決了集中設計的關鍵限制,同時在推理密集型任務中展示了卓越的性能。

Debiasing Large Language Models toward Social Factors in Online Behavior Analytics through Prompt Knowledge Tuning

2603.27057v1 by Hossein Salemi, Jitin Krishnan, Hemant Purohit

Attribution theory explains how individuals interpret and attribute others' behavior in a social context by employing personal (dispositional) and impersonal (situational) causality. Large Language Models (LLMs), trained on human-generated corpora, may implicitly mimic this social attribution process in social contexts. However, the extent to which LLMs utilize these causal attributions in their reasoning remains underexplored. Although using reasoning paradigms, such as Chain-of-Thought (CoT), has shown promising results in various tasks, ignoring social attribution in reasoning could lead to biased responses by LLMs in social contexts. In this study, we investigate the impact of incorporating a user's goal as knowledge to infer dispositional causality and message context to infer situational causality on LLM performance. To this end, we introduce a scalable method to mitigate such biases by enriching the instruction prompts for LLMs with two prompt aids using social-attribution knowledge, based on the context and goal of a social media message. This method improves the model performance while reducing the social-attribution bias of the LLM in the reasoning on zero-shot classification tasks for behavior analytics applications. We empirically show the benefits of our method across two tasks-intent detection and theme detection on social media in the disaster domain-when considering the variability of disaster types and multiple languages of social media. Our experiments highlight the biases of three open-source LLMs: Llama3, Mistral, and Gemma, toward social attribution, and show the effectiveness of our mitigation strategies.

摘要:歸因理論解釋了個體如何在社會背景中解釋和歸因他人的行為,通過運用個人(性格)和非個人(情境)因果關係。大型語言模型(LLMs)在以人類生成的語料庫進行訓練時,可能會在社會情境中隱含地模仿這一社會歸因過程。然而,LLMs在推理中利用這些因果歸因的程度仍然未被深入探討。儘管使用推理範式,如思維鏈(CoT),在各種任務中顯示出有希望的結果,但在推理中忽視社會歸因可能導致LLMs在社會情境中產生偏見的回應。在本研究中,我們調查了將用戶目標作為知識來推斷性格因果關係,以及將消息背景用於推斷情境因果關係對LLM表現的影響。為此,我們引入了一種可擴展的方法,通過基於社交媒體消息的背景和目標,使用社會歸因知識來豐富LLMs的指令提示,從而減輕這些偏見。這種方法在行為分析應用的零樣本分類任務中改善了模型性能,同時減少了LLM在推理中的社會歸因偏見。我們實證展示了我們的方法在兩個任務——災難領域社交媒體上的意圖檢測和主題檢測——中的好處,考慮到災難類型的變異性和社交媒體的多語言性。我們的實驗突顯了三個開源LLMs:Llama3、Mistral和Gemma,對社會歸因的偏見,並展示了我們的緩解策略的有效性。

PRISMA: Toward a Normative Information Infrastructure for Responsible Pharmaceutical Knowledge Management

2603.26324v1 by Eugenio Rodrigo Zimmer Neves, Amanda Vanon Correa, Camila Campioni, Gabielli Pare Guglielmi, Bruno Morelli

Most existing approaches to AI in pharmacy collapse three epistemologically distinct operations into a single technical layer: document preservation, semantic interpretation, and contextual presentation. This conflation is a root cause of recurring fragilities including loss of provenance, interpretive opacity, alert fatigue, and erosion of accountability. This paper proposes the PATOS--Lector--PRISMA (PLP) infrastructure as a normative information architecture for responsible pharmaceutical knowledge management. PATOS preserves regulatory documents with explicit versioning and provenance; Lector implements machine-assisted reading with human curation, producing typed assertions anchored to primary sources; PRISMA delivers contextual presentation through the RPDA framework (Regulatory, Prescription, Dispensing, Administration), refracting the same informational core into distinct professional views. The architecture introduces the Evidence Pack as a formal unit of accountable assertion (versioned, traceable, epistemically bounded, and curatorially validated), with assertions typified by illocutionary force. A worked example traces dipyrone monohydrate across all three layers using real system data. Developed and validated in Brazil's regulatory context, the architecture is grounded in an operational implementation comprising over 16,000 official documents and 38 curated Evidence Packs spanning five reference medications. The proposal is demonstrated as complementary to operational decision support systems, providing infrastructural conditions that current systems lack: documentary anchoring, interpretive transparency, and institutional accountability.

摘要:大多數現有的藥學人工智慧方法將三個在認識論上明顯不同的操作合併為一個單一的技術層:文件保存、語義解釋和上下文呈現。這種混淆是導致反覆出現的脆弱性的根本原因,包括來源丟失、解釋不明、警報疲勞和問責制侵蝕。本文提出PATOS--Lector--PRISMA (PLP) 基礎設施作為負責任的藥學知識管理的規範性信息架構。PATOS以明確的版本控制和來源保存監管文件;Lector實施人機協作的閱讀,產生與主要來源相連的類型化斷言;PRISMA通過RPDA框架(監管、處方、配藥、管理)提供上下文呈現,將相同的信息核心折射成不同的專業視角。該架構引入了證據包作為一個正式的可問責斷言單位(版本化、可追溯、認識論界定且經過策展驗證),其斷言以言外之意的強度為特徵。一個實例追踪了單硫酸二氫鈉在所有三個層面上的應用,使用真實系統數據。在巴西的監管背景下開發和驗證,該架構基於一個操作實施,包含超過16,000份官方文件和38個策展的證據包,涵蓋五種參考藥物。該提案被證明是對操作決策支持系統的補充,提供了當前系統所缺乏的基礎設施條件:文件錨定、解釋透明度和機構問責制。

Sparse Auto-Encoders and Holism about Large Language Models

2603.26207v1 by Jumbly Grindrod

Does Large Language Model (LLM) technology suggest a meta-semantic picture i.e. a picture of how words and complex expressions come to have the meaning that they do? One modest approach explores the assumptions that seem to be built into how LLMs capture the meanings of linguistic expressions as a way of considering their plausibility (Grindrod, 2026a, 2026b). It has previously been argued that LLMs, in employing a form of distributional semantics, adopt a form of holism about meaning (Grindrod, 2023; Grindrod et al., forthcoming). However, recent work in mechanistic interpretability presents a challenge to these arguments. Specifically, the discovery of a vast array of interpretable latent features within the high dimensional spaces used by LLMs potentially challenges the holistic interpretation. In this paper, I will present the original reasons for thinking that LLMs embody a form of holism (section 1), before introducing recent work on features generated through sparse auto-encoders, and explaining how the discovery of such features suggests an alternative decompositional picture of meaning (section 2). I will then respond to this challenge by considering in greater detail the nature of such features (section 3). Finally, I will return to the holistic picture defended by Grindrod et al. and argue that the picture still stands provided that the features are countable (section 4).

摘要:大型語言模型(LLM)技術是否暗示了一種元語義圖景,即單詞和複雜表達如何獲得其意義的圖景?一種謙遜的方式探討了似乎內建於LLM如何捕捉語言表達意義的假設,以此來考慮它們的合理性(Grindrod, 2026a, 2026b)。之前有人主張,LLM在採用一種分佈語義形式時,對意義採取了一種整體主義的觀點(Grindrod, 2023; Grindrod et al., forthcoming)。然而,最近在機械可解釋性方面的研究對這些論點提出了挑戰。具體而言,在LLM使用的高維空間中發現大量可解釋的潛在特徵,可能對整體解釋提出挑戰。在本文中,我將介紹認為LLM體現了一種整體主義的原始理由(第1節),然後介紹通過稀疏自編碼器生成的特徵的最新研究,並解釋這些特徵的發現如何暗示了一種替代的意義分解圖景(第2節)。接著,我將通過更詳細地考慮這些特徵的性質來回應這一挑戰(第3節)。最後,我將回到Grindrod等人所辯護的整體圖景,並主張只要這些特徵是可數的,該圖景仍然成立(第4節)。

Concerning Uncertainty -- A Systematic Survey of Uncertainty-Aware XAI

2603.26838v1 by Helena Löfström, Tuwe Löfström, Anders Hjort, Fatima Rabia Yapicioglu

This paper surveys uncertainty-aware explainable artificial intelligence (UAXAI), examining how uncertainty is incorporated into explanatory pipelines and how such methods are evaluated. Across the literature, three recurring approaches to uncertainty quantification emerge (Bayesian, Monte Carlo, and Conformal methods), alongside distinct strategies for integrating uncertainty into explanations: assessing trustworthiness, constraining models or explanations, and explicitly communicating uncertainty. Evaluation practices remain fragmented and largely model centered, with limited attention to users and inconsistent reporting of reliability properties (e.g., calibration, coverage, explanation stability). Recent work leans towards calibration, distribution free techniques and recognizes explainer variability as a central concern. We argue that progress in UAXAI requires unified evaluation principles that link uncertainty propagation, robustness, and human decision-making, and highlight counterfactual and calibration approaches as promising avenues for aligning interpretability with reliability.

摘要:這篇論文調查了具不確定性感知的可解釋人工智慧(UAXAI),檢視不確定性如何融入解釋流程以及這些方法如何被評估。在文獻中,出現了三種重複的不確定性量化方法(貝葉斯、蒙地卡羅和符合方法),以及將不確定性整合到解釋中的不同策略:評估可信度、限制模型或解釋,以及明確傳達不確定性。評估實踐仍然是片段化的,並且主要集中在模型上,對用戶的關注有限,且可靠性特徵的報告不一致(例如,校準、覆蓋率、解釋穩定性)。最近的研究傾向於校準、無分佈技術,並將解釋者的變異性視為一個核心問題。我們認為,UAXAI的進展需要統一的評估原則,將不確定性傳播、穩健性和人類決策聯繫起來,並強調反事實和校準方法作為將可解釋性與可靠性對齊的有希望的途徑。

SkinGPT-X: A Self-Evolving Collaborative Multi-Agent System for Transparent and Trustworthy Dermatological Diagnosis

2603.26122v1 by Zhangtianyi Chen, Yuhao Shen, Florensia Widjaja, Yan Xu, Liyuan Sun, Zijian Wang, Hongyi Chen, Wufei Dai, Juexiao Zhou

While recent advancements in Large Language Models have significantly advanced dermatological diagnosis, monolithic LLMs frequently struggle with fine-grained, large-scale multi-class diagnostic tasks and rare skin disease diagnosis owing to training data sparsity, while also lacking the interpretability and traceability essential for clinical reasoning. Although multi-agent systems can offer more transparent and explainable diagnostics, existing frameworks are primarily concentrated on Visual Question Answering and conversational tasks, and their heavy reliance on static knowledge bases restricts adaptability in complex real-world clinical settings. Here, we present SkinGPT-X, a multimodal collaborative multi-agent system for dermatological diagnosis integrated with a self-evolving dermatological memory mechanism. By simulating the diagnostic workflow of dermatologists and enabling continuous memory evolution, SkinGPT-X delivers transparent and trustworthy diagnostics for the management of complex and rare dermatological cases. To validate the robustness of SkinGPT-X, we design a three-tier comparative experiment. First, we benchmark SkinGPT-X against four state-of-the-art LLMs across four public datasets, demonstrating its state-of-the-art performance with a +9.6% accuracy improvement on DDI31 and +13% weighted F1 gain on Dermnet over the state-of-the-art model. Second, we construct a large-scale multi-class dataset covering 498 distinct dermatological categories to evaluate its fine-grained classification capabilities. Finally, we curate the rare skin disease dataset, the first benchmark to address the scarcity of clinical rare skin diseases which contains 564 clinical samples with eight rare dermatological diseases. On this dataset, SkinGPT-X achieves a +9.8% accuracy improvement, a +7.1% weighted F1 improvement, a +10% Cohen's Kappa improvement.

摘要:雖然近期在大型語言模型方面的進展顯著推進了皮膚科診斷,但單一的 LLM 在細粒度、大規模多類別診斷任務和罕見皮膚疾病診斷方面經常面臨困難,這主要是由於訓練數據的稀疏性,同時也缺乏臨床推理所需的可解釋性和可追溯性。儘管多代理系統可以提供更透明和可解釋的診斷,但現有框架主要集中在視覺問答和對話任務上,並且對靜態知識庫的高度依賴限制了其在複雜現實臨床環境中的適應性。在此,我們提出了 SkinGPT-X,一個多模態協作多代理系統,專為皮膚科診斷而設,並整合了自我演化的皮膚科記憶機制。通過模擬皮膚科醫生的診斷工作流程並實現持續的記憶演變,SkinGPT-X 提供了透明且值得信賴的診斷,以管理複雜和罕見的皮膚科病例。為了驗證 SkinGPT-X 的穩健性,我們設計了一個三級比較實驗。首先,我們將 SkinGPT-X 與四個最先進的 LLM 在四個公共數據集上進行基準測試,顯示其在 DDI31 上的準確率提高了 +9.6%,在 Dermnet 上的加權 F1 分數提高了 +13%。其次,我們構建了一個涵蓋 498 種不同皮膚科類別的大型多類別數據集,以評估其細粒度分類能力。最後,我們整理了罕見皮膚疾病數據集,這是首個針對臨床罕見皮膚疾病稀缺問題的基準,包含 564 份臨床樣本,涵蓋八種罕見皮膚病。在這個數據集上,SkinGPT-X 實現了 +9.8% 的準確率提高,+7.1% 的加權 F1 提高,以及 +10% 的 Cohen's Kappa 提高。

DPD-Cancer: Explainable Graph-based Deep Learning for Small Molecule Anti-Cancer Activity Prediction

2603.26114v1 by Magnus H. Strømme, Alex G. C. de Sá, David B. Ascher

Accurate drug response prediction is a critical bottleneck in computational biochemistry, limited by the challenge of modelling the interplay between molecular structure and cellular context. In cancer research, this is acute due to tumour heterogeneity and genomic variability, which hinder the identification of effective therapies. Conventional approaches often fail to capture non-linear relationships between chemical features and biological outcomes across diverse cell lines. To address this, we introduce DPD-Cancer, a deep learning method based on a Graph Attention Transformer (GAT) framework. It is designed for small molecule anti-cancer activity classification and the quantitative prediction of cell-line specific responses, specifically growth inhibition concentration (pGI50). Benchmarked against state-of-the-art methods (pdCSM-cancer, ACLPred, and MLASM), DPD-Cancer demonstrated superior performance, achieving an Area Under ROC Curve (AUC) of up to 0.87 on strictly partitioned NCI60 data and up to 0.98 on ACLPred/MLASM datasets. For pGI50 prediction across 10 cancer types and 73 cell lines, the model achieved Pearson's correlation coefficients of up to 0.72 on independent test sets. These findings confirm that attention-based mechanisms offer significant advantages in extracting meaningful molecular representations, establishing DPD-Cancer as a competitive tool for prioritising drug candidates. Furthermore, DPD-Cancer provides explainability by leveraging the attention mechanism to identify and visualise specific molecular substructures, offering actionable insights for lead optimisation. DPD-Cancer is freely available as a web server at: https://biosig.lab.uq.edu.au/dpd_cancer/.

摘要:準確的藥物反應預測是計算生物化學中的一個關鍵瓶頸,受到分子結構與細胞環境之間相互作用建模挑戰的限制。在癌症研究中,這一問題尤為嚴重,因為腫瘤的異質性和基因組變異性妨礙了有效療法的識別。傳統方法往往無法捕捉化學特徵與生物學結果之間的非線性關係,尤其是在多樣的細胞系中。為了解決這一問題,我們推出了DPD-Cancer,這是一種基於圖注意力Transformer(Graph Attention Transformer, GAT)框架的深度學習方法。它旨在進行小分子抗癌活性分類以及細胞系特異性反應的定量預測,特別是生長抑制濃度(pGI50)。在與最先進的方法(pdCSM-cancer、ACLPred 和 MLASM)的基準測試中,DPD-Cancer展現了卓越的性能,在嚴格劃分的NCI60數據上達到高達0.87的ROC曲線下面積(AUC),在ACLPred/MLASM數據集上達到高達0.98。針對10種癌症類型和73個細胞系的pGI50預測,該模型在獨立測試集上達到了高達0.72的皮爾森相關係數。這些發現確認了基於注意力的機制在提取有意義的分子表示方面提供了顯著優勢,確立了DPD-Cancer作為優先選擇藥物候選者的競爭工具。此外,DPD-Cancer通過利用注意力機制來識別和可視化特定的分子子結構,提供了解釋性,為先導優化提供可行的見解。DPD-Cancer作為網頁伺服器免費提供,網址為:https://biosig.lab.uq.edu.au/dpd_cancer/。

A Regression Framework for Understanding Prompt Component Impact on LLM Performance

2603.26830v1 by Andrew Lauziere, Jonathan Daugherty, Taisa Kushner

As large language models (LLMs) continue to improve and see further integration into software systems, so does the need to understand the conditions in which they will perform. We contribute a statistical framework for understanding the impact of specific prompt features on LLM performance. The approach extends previous explainable artificial intelligence (XAI) methods specifically to inspect LLMs by fitting regression models relating portions of the prompt to LLM evaluation. We apply our method to compare how two open-source models, Mistral-7B and GPT-OSS-20B, leverage the prompt to perform a simple arithmetic problem. Regression models of individual prompt portions explain 72% and 77% of variation in model performances, respectively. We find misinformation in the form of incorrect example query-answer pairs impedes both models from solving the arithmetic query, though positive examples do not find significant variability in the impact of positive and negative instructions - these prompts have contradictory effects on model performance. The framework serves as a tool for decision makers in critical scenarios to gain granular insight into how the prompt influences an LLM to solve a task.

摘要:隨著大型語言模型(LLMs)不斷改進並進一步融入軟體系統,了解它們在何種條件下表現的需求也隨之增加。我們貢獻了一個統計框架,以理解特定提示特徵對LLM性能的影響。這種方法擴展了以前的可解釋人工智慧(XAI)方法,專門用於檢查LLMs,通過擬合回歸模型將提示的部分與LLM評估相關聯。我們應用我們的方法來比較兩個開源模型,Mistral-7B和GPT-OSS-20B,如何利用提示來解決一個簡單的算術問題。個別提示部分的回歸模型分別解釋了模型性能變異的72%和77%。我們發現,以不正確的示例查詢-回答對的形式出現的錯誤信息,妨礙了這兩個模型解決算術查詢,儘管正面示例在正面和負面指令的影響上並未顯示出顯著變異——這些提示對模型性能有矛盾的影響。該框架作為關鍵情境中決策者的工具,提供了有關提示如何影響LLM解決任務的細緻見解。

FairLLaVA: Fairness-Aware Parameter-Efficient Fine-Tuning for Large Vision-Language Assistants

2603.26008v1 by Mahesh Bhosale, Abdul Wasi, Shantam Srivastava, Shifa Latif, Tianyu Luan, Mingchen Gao, David Doermann, Xuan Gong

While powerful in image-conditioned generation, multimodal large language models (MLLMs) can display uneven performance across demographic groups, highlighting fairness risks. In safety-critical clinical settings, such disparities risk producing unequal diagnostic narratives and eroding trust in AI-assisted decision-making. While fairness has been studied extensively in vision-only and language-only models, its impact on MLLMs remains largely underexplored. To address these biases, we introduce FairLLaVA, a parameter-efficient fine-tuning method that mitigates group disparities in visual instruction tuning without compromising overall performance. By minimizing the mutual information between target attributes, FairLLaVA regularizes the model's representations to be demographic-invariant. The method can be incorporated as a lightweight plug-in, maintaining efficiency with low-rank adapter fine-tuning, and provides an architecture-agnostic approach to fair visual instruction following. Extensive experiments on large-scale chest radiology report generation and dermoscopy visual question answering benchmarks show that FairLLaVA consistently reduces inter-group disparities while improving both equity-scaled clinical performance and natural language generation quality across diverse medical imaging modalities. Code can be accessed at https://github.com/bhosalems/FairLLaVA.

摘要:雖然在圖像條件生成方面具有強大能力,多模態大型語言模型(MLLMs)在不同人口群體之間的表現卻可能不均衡,凸顯了公平風險。在安全至關重要的臨床環境中,這種差異可能導致不平等的診斷敘事,並侵蝕對AI輔助決策的信任。儘管公平性在僅限於視覺和僅限於語言的模型中已被廣泛研究,但其對MLLMs的影響仍然大多未被探討。為了解決這些偏見,我們引入了FairLLaVA,一種參數高效的微調方法,能在不妥協整體性能的情況下減輕視覺指令調整中的群體差異。通過最小化目標屬性之間的互信息,FairLLaVA使模型的表示變得與人口統計無關。該方法可以作為輕量級插件納入,保持低秩適配器微調的效率,並提供一種與架構無關的公平視覺指令跟隨方法。在大規模胸部放射報告生成和皮膚鏡視覺問題回答基準上的廣泛實驗表明,FairLLaVA持續減少群體間的差異,同時提高各種醫學影像模式下的公平性縮放臨床表現和自然語言生成質量。代碼可在 https://github.com/bhosalems/FairLLaVA 獲取。

Do Neurons Dream of Primitive Operators? Wake-Sleep Compression Rediscovers Schank's Event Semantics

2603.25975v1 by Peter Balogh

We show that they do. Schank's conceptual dependency theory proposed that all events decompose into primitive operations -- ATRANS, PTRANS, MTRANS, and others -- hand-coded from linguistic intuition. Can the same primitives be discovered automatically through compression pressure alone? We adapt DreamCoder's wake-sleep library learning to event state transformations. Given events as before/after world state pairs, our system finds operator compositions explaining each event (wake), then extracts recurring patterns as new operators optimized under Minimum Description Length (sleep). Starting from four generic primitives, it discovers operators mapping directly to Schank's: MOVE_PROP_has = ATRANS, CHANGE_location = PTRANS, SET_knows = MTRANS, SET_consumed = INGEST, plus compound operators ("mail" = ATRANS + PTRANS) and novel emotional state operators absent from Schank's taxonomy. We validate on synthetic events and real-world commonsense data from the ATOMIC knowledge graph. On synthetic data, discovered operators achieve Bayesian MDL within 4% of Schank's hand-coded primitives while explaining 100% of events vs. Schank's 81%. On ATOMIC, results are more dramatic: Schank's primitives explain only 10% of naturalistic events, while the discovered library explains 100%. Dominant operators are not physical-action primitives but mental and emotional state changes -- CHANGE_wants (20%), CHANGE_feels (18%), CHANGE_is (18%) -- none in Schank's original taxonomy. These results provide the first empirical evidence that event primitives can be derived from compression pressure, that Schank's core primitives are information-theoretically justified, and that the complete inventory is substantially richer than proposed -- with mental/emotional operators dominating in naturalistic data.

摘要:我們展示了它們確實存在。Schank 的概念依賴理論提出所有事件都可分解為原始操作——ATRANS、PTRANS、MTRANS 等——這些操作是基於語言直覺手動編碼的。是否可以僅通過壓縮壓力自動發現相同的原始操作?
我們將 DreamCoder 的醒眠庫學習調整為事件狀態轉換。給定事件作為前後世界狀態對,我們的系統找到解釋每個事件的操作組合(醒),然後提取作為新操作的重複模式,這些模式在最小描述長度下進行優化(眠)。從四個通用原始操作開始,它發現直接映射到 Schank 的操作:MOVE_PROP_has = ATRANS、CHANGE_location = PTRANS、SET_knows = MTRANS、SET_consumed = INGEST,還有複合操作(“mail” = ATRANS + PTRANS)和 Schank 的分類中缺失的新情感狀態操作。
我們在合成事件和來自 ATOMIC 知識圖譜的現實世界常識數據上進行驗證。在合成數據上,發現的操作在解釋 100% 事件的同時,實現了與 Schank 的手動編碼原始操作在 4% 內的貝葉斯 MDL,而 Schank 只解釋了 81%。在 ATOMIC 上,結果更為戲劇性:Schank 的原始操作僅解釋了 10% 的自然事件,而發現的庫解釋了 100%。主導操作不是物理行動原始操作,而是心理和情感狀態變化——CHANGE_wants(20%)、CHANGE_feels(18%)、CHANGE_is(18%)——這些在 Schank 的原始分類中都不存在。
這些結果提供了第一個實證證據,表明事件原始操作可以從壓縮壓力中推導出來,Schank 的核心原始操作在信息理論上是有正當理由的,並且完整的清單在提出的基礎上實質上更為豐富——在自然數據中,心理/情感操作占主導地位。

Methods for Knowledge Graph Construction from Text Collections: Development and Applications

2603.25862v1 by Vanni Zavarella

Virtually every sector of society is experiencing a dramatic growth in the volume of unstructured textual data that is generated and published, from news and social media online interactions, through open access scholarly communications and observational data in the form of digital health records and online drug reviews. The volume and variety of data across all this range of domains has created both unprecedented opportunities and pressing challenges for extracting actionable knowledge for several application scenarios. However, the extraction of rich semantic knowledge demands the deployment of scalable and flexible automatic methods adaptable across text genres and schema specifications. Moreover, the full potential of these data can only be unlocked by coupling information extraction methods with Semantic Web techniques for the construction of full-fledged Knowledge Graphs, that are semantically transparent, explainable by design and interoperable. In this thesis, we experiment with the application of Natural Language Processing, Machine Learning and Generative AI methods, powered by Semantic Web best practices, to the automatic construction of Knowledge Graphs from large text corpora, in three use case applications: the analysis of the Digital Transformation discourse in the global news and social media platforms; the mapping and trend analysis of recent research in the Architecture, Engineering, Construction and Operations domain from a large corpus of publications; the generation of causal relation graphs of biomedical entities from electronic health records and patient-authored drug reviews. The contributions of this thesis to the research community are in terms of benchmark evaluation results, the design of customized algorithms and the creation of data resources in the form of Knowledge Graphs, together with data analysis results built on top of them.

摘要:幾乎每個社會領域都在經歷著未結構化文本數據生成和發佈量的劇增,這些數據來自於新聞和社交媒體的在線互動、開放存取的學術交流以及以數位健康記錄和在線藥物評價形式呈現的觀察數據。這些領域中數據的量和多樣性創造了前所未有的機會和迫切的挑戰,以提取可行的知識以應用於多個場景。然而,提取豐富的語義知識需要部署可擴展且靈活的自動化方法,這些方法能夠適應不同的文本類型和架構規範。此外,這些數據的全部潛力僅能通過將信息提取方法與語義網技術結合來釋放,以構建語義透明、設計上可解釋且可互操作的完整知識圖譜。在本論文中,我們實驗應用自然語言處理、機器學習和生成式AI方法,這些方法由語義網最佳實踐驅動,實現從大型文本語料庫自動構建知識圖譜,並針對三個使用案例進行應用:分析全球新聞和社交媒體平台中的數位轉型話語;從大量出版物中映射和趨勢分析建築、工程、建設和運營領域的最新研究;從電子健康記錄和患者撰寫的藥物評價中生成生物醫學實體的因果關係圖。這篇論文對研究社群的貢獻體現在基準評估結果、定制算法的設計以及以知識圖譜形式創建的數據資源,連同基於這些資源構建的數據分析結果。

A Compression Perspective on Simplicity Bias

2603.25839v1 by Tom Marty, Eric Elmoznino, Leo Gagnon, Tejas Kasetty, Mizu Nishikawa-Toomey, Sarthak Mittal, Guillaume Lajoie, Dhanya Sridhar

Deep neural networks exhibit a simplicity bias, a well-documented tendency to favor simple functions over complex ones. In this work, we cast new light on this phenomenon through the lens of the Minimum Description Length principle, formalizing supervised learning as a problem of optimal two-part lossless compression. Our theory explains how simplicity bias governs feature selection in neural networks through a fundamental trade-off between model complexity (the cost of describing the hypothesis) and predictive power (the cost of describing the data). Our framework predicts that as the amount of available training data increases, learners transition through qualitatively different features -- from simple spurious shortcuts to complex features -- only when the reduction in data encoding cost justifies the increased model complexity. Consequently, we identify distinct data regimes where increasing data promotes robustness by ruling out trivial shortcuts, and conversely, regimes where limiting data can act as a form of complexity-based regularization, preventing the learning of unreliable complex environmental cues. We validate our theory on a semi-synthetic benchmark showing that the feature selection of neural networks follows the same trajectory of solutions as optimal two-part compressors.

摘要:深度神經網絡表現出簡單性偏見,這是一種已被充分記錄的趨勢,傾向於偏好簡單函數而非複雜函數。在這項工作中,我們通過最小描述長度原則的視角,為這一現象提供了新的見解,將監督學習形式化為最佳雙部分無損壓縮的問題。我們的理論解釋了簡單性偏見如何通過模型複雜性(描述假設的成本)和預測能力(描述數據的成本)之間的基本權衡來支配神經網絡中的特徵選擇。我們的框架預測,隨著可用訓練數據量的增加,學習者會經歷質量上不同的特徵轉變——從簡單的虛假捷徑到複雜特徵——僅當數據編碼成本的降低足以證明增加的模型複雜性是合理的。因此,我們確定了不同的數據範疇,在這些範疇中,增加數據促進了穩健性,排除了微不足道的捷徑;相反,限制數據可以作為一種基於複雜性的正則化形式,防止學習不可靠的複雜環境線索。我們在一個半合成基準上驗證了我們的理論,顯示神經網絡的特徵選擇遵循與最佳雙部分壓縮器相同的解決方案軌跡。

Doctorina MedBench: End-to-End Evaluation of Agent-Based Medical AI

2603.25821v1 by Anna Kozlova, Stanislau Salavei, Pavel Satalkin, Hanna Plotnitskaya, Sergey Parfenyuk

We present Doctorina MedBench, a comprehensive evaluation framework for agent-based medical AI based on the simulation of realistic physician-patient interactions. Unlike traditional medical benchmarks that rely on solving standardized test questions, the proposed approach models a multi-step clinical dialogue in which either a physician or an AI system must collect medical history, analyze attached materials (including laboratory reports, images, and medical documents), formulate differential diagnoses, and provide personalized recommendations. System performance is evaluated using the D.O.T.S. metric, which consists of four components: Diagnosis, Observations/Investigations, Treatment, and Step Count, enabling assessment of both clinical correctness and dialogue efficiency. The system also incorporates a multi-level testing and quality monitoring architecture designed to detect model degradation during both development and deployment. The framework supports safety-oriented trap cases, category-based random sampling of clinical scenarios, and full regression testing. The dataset currently contains more than 1,000 clinical cases covering over 750 diagnoses. The universality of the evaluation metrics allows the framework to be used not only to assess medical AI systems, but also to evaluate physicians and support the development of clinical reasoning skills. Our results suggest that simulation of clinical dialogue may provide a more realistic assessment of clinical competence compared to traditional examination-style benchmarks.

摘要:我們提出了 Doctorina MedBench,這是一個基於模擬現實醫生-病人互動的代理醫療 AI 的綜合評估框架。與依賴解決標準化測試問題的傳統醫療基準不同,所提出的方法模型化了一個多步驟的臨床對話,在這個對話中,醫生或 AI 系統必須收集病歷、分析附加材料(包括實驗室報告、影像和醫療文件)、制定鑑別診斷並提供個性化建議。系統性能使用 D.O.T.S. 指標進行評估,該指標由四個組成部分構成:診斷、觀察/調查、治療和步驟計數,能夠評估臨床正確性和對話效率。

該系統還包含一個多層次的測試和質量監控架構,旨在在開發和部署期間檢測模型退化。該框架支持以安全為導向的陷阱案例、基於類別的臨床場景隨機抽樣以及全面的回歸測試。數據集目前包含超過 1,000 個臨床案例,涵蓋超過 750 種診斷。評估指標的普遍性使得該框架不僅可以用來評估醫療 AI 系統,還可以評估醫生並支持臨床推理技能的發展。我們的結果表明,臨床對話的模擬可能提供比傳統考試風格基準更現實的臨床能力評估。

DeepFAN, a transformer-based deep learning model for human-artificial intelligence collaborative assessment of incidental pulmonary nodules in CT scans: a multi-reader, multi-case trial

2603.25607v1 by Zhenchen Zhu, Ge Hu, Weixiong Tan, Kai Gao, Chao Sun, Zhen Zhou, Kepei Xu, Wei Han, Meixia Shang, Xiaoming Qiu, Yiqing Tan, Jinhua Wang, Zhoumeng Ying, Li Peng, Wei Song, Lan Song, Zhengyu Jin, Nan Hong, Yizhou Yu

The widespread adoption of CT has notably increased the number of detected lung nodules. However, current deep learning methods for classifying benign and malignant nodules often fail to comprehensively integrate global and local features, and most of them have not been validated through clinical trials. To address this, we developed DeepFAN, a transformer-based model trained on over 10K pathology-confirmed nodules and further conducted a multi-reader, multi-case clinical trial to evaluate its efficacy in assisting junior radiologists. DeepFAN achieved diagnostic area under the curve (AUC) of 0.939 (95% CI 0.930-0.948) on an internal test set and 0.954 (95% CI 0.934-0.973) on the clinical trial dataset involving 400 cases across three independent medical institutions. Explainability analysis indicated higher contributions from global than local features. Twelve readers' average performance significantly improved by 10.9% (95% CI 8.3%-13.5%) in AUC, 10.0% (95% CI 8.9%-11.1%) in accuracy, 7.6% (95% CI 6.1%-9.2%) in sensitivity, and 12.6% (95% CI 10.9%-14.3%) in specificity (P<0.001 for all). Nodule-level inter-reader diagnostic consistency improved from fair to moderate (overall k: 0.313 vs. 0.421; P=0.019). In conclusion, DeepFAN effectively assisted junior radiologists and may help homogenize diagnostic quality and reduce unnecessary follow-up of indeterminate pulmonary nodules. Chinese Clinical Trial Registry: ChiCTR2400084624.

摘要:CT的廣泛應用顯著增加了檢測到的肺結節數量。然而,當前用於分類良性和惡性結節的深度學習方法往往未能全面整合全球和局部特徵,且大多數尚未通過臨床試驗進行驗證。為了解決這個問題,我們開發了DeepFAN,一種基於Transformer的模型,該模型在超過10K病理確認的結節上進行了訓練,並進一步進行了多讀者、多案例的臨床試驗,以評估其在輔助初級放射科醫生方面的有效性。DeepFAN在內部測試集上達到了0.939的診斷曲線下面積(AUC)(95% CI 0.930-0.948),在涉及三個獨立醫療機構的400個案例的臨床試驗數據集上達到了0.954(95% CI 0.934-0.973)。可解釋性分析顯示全球特徵的貢獻高於局部特徵。十二位讀者的平均表現顯著提高了10.9%(95% CI 8.3%-13.5%)的AUC,10.0%(95% CI 8.9%-11.1%)的準確率,7.6%(95% CI 6.1%-9.2%)的敏感性,以及12.6%(95% CI 10.9%-14.3%)的特異性(所有P<0.001)。結節級別的讀者間診斷一致性從公平改善到中等(整體k: 0.313 vs. 0.421; P=0.019)。總之,DeepFAN有效地輔助了初級放射科醫生,並可能有助於均化診斷質量,減少對不確定肺結節的不必要隨訪。中國臨床試驗登記:ChiCTR2400084624。

From Manipulation to Mistrust: Explaining Diverse Micro-Video Misinformation for Robust Debunking in the Wild

2603.25423v1 by Zhi Zeng, Yifei Yang, Jiaying Wu, Xulang Zhang, Xiangzheng Kong, Herun Wan, Zihan Ma, Minnan Luo

The rise of micro-videos has reshaped how misinformation spreads, amplifying its speed, reach, and impact on public trust. Existing benchmarks typically focus on a single deception type, overlooking the diversity of real-world cases that involve multimodal manipulation, AI-generated content, cognitive bias, and out-of-context reuse. Meanwhile, most detection models lack fine-grained attribution, limiting interpretability and practical utility. To address these gaps, we introduce WildFakeBench, a large-scale benchmark of over 10,000 real-world micro-videos covering diverse misinformation types and sources, each annotated with expert-defined attribution labels. Building on this foundation, we develop FakeAgent, a Delphi-inspired multi-agent reasoning framework that integrates multimodal understanding with external evidence for attribution-grounded analysis. FakeAgent jointly analyzes content and retrieved evidence to identify manipulation, recognize cognitive and AI-generated patterns, and detect out-of-context misinformation. Extensive experiments show that FakeAgent consistently outperforms existing MLLMs across all misinformation types, while WildFakeBench provides a realistic and challenging testbed for advancing explainable micro-video misinformation detection. Data and code are available at: https://github.com/Aiyistan/FakeAgent.

摘要:微視頻的興起重塑了錯誤資訊的傳播方式,放大了其速度、範圍和對公眾信任的影響。現有的基準通常專注於單一的欺騙類型,忽略了涉及多模態操控、AI生成內容、認知偏見和脫離上下文重用的現實案例的多樣性。與此同時,大多數檢測模型缺乏細緻的歸因,限制了可解釋性和實際效用。為了填補這些空白,我們推出了WildFakeBench,這是一個涵蓋超過10,000個現實微視頻的大型基準,涵蓋多種錯誤資訊類型和來源,每個視頻都附有專家定義的歸因標籤。在此基礎上,我們開發了FakeAgent,一個受德爾菲啟發的多代理推理框架,將多模態理解與外部證據整合以進行基於歸因的分析。FakeAgent共同分析內容和檢索的證據,以識別操控、識別認知和AI生成的模式,並檢測脫離上下文的錯誤資訊。廣泛的實驗顯示,FakeAgent在所有錯誤資訊類型上始終超越現有的MLLM,而WildFakeBench則提供了一個現實且具有挑戰性的測試平台,以促進可解釋的微視頻錯誤資訊檢測。數據和代碼可在以下鏈接獲取:https://github.com/Aiyistan/FakeAgent。

Shape and Substance: Dual-Layer Side-Channel Attacks on Local Vision-Language Models

2603.25403v2 by Eyal Hadad, Mordechai Guri

On-device Vision-Language Models (VLMs) promise data privacy via local execution. However, we show that the architectural shift toward Dynamic High-Resolution preprocessing (e.g., AnyRes) introduces an inherent algorithmic side-channel. Unlike static models, dynamic preprocessing decomposes images into a variable number of patches based on their aspect ratio, creating workload-dependent inputs. We demonstrate a dual-layer attack framework against local VLMs. In Tier 1, an unprivileged attacker can exploit significant execution-time variations using standard unprivileged OS metrics to reliably fingerprint the input's geometry. In Tier 2, by profiling Last-Level Cache (LLC) contention, the attacker can resolve semantic ambiguity within identical geometries, distinguishing between visually dense (e.g., medical X-rays) and sparse (e.g., text documents) content. By evaluating state-of-the-art models such as LLaVA-NeXT and Qwen2-VL, we show that combining these signals enables reliable inference of privacy-sensitive contexts. Finally, we analyze the security engineering trade-offs of mitigating this vulnerability, reveal substantial performance overhead with constant-work padding, and propose practical design recommendations for secure Edge AI deployments.

摘要:在裝置上的視覺-語言模型(VLMs)透過本地執行承諾數據隱私。然而,我們顯示出向動態高解析度預處理(例如 AnyRes)的架構轉變引入了一個固有的算法側信道。與靜態模型不同,動態預處理根據圖像的長寬比將其分解為可變數量的補丁,創造出依賴於工作負載的輸入。我們展示了一個針對本地 VLMs 的雙層攻擊框架。在第一層,未特權的攻擊者可以利用標準未特權操作系統指標來利用顯著的執行時間變化,可靠地指紋輸入的幾何形狀。在第二層,通過分析最後級快取(LLC)的競爭,攻擊者可以解決相同幾何形狀中的語義模糊,區分視覺上密集(例如醫療 X 光)和稀疏(例如文本文件)內容。通過評估最先進的模型,如 LLaVA-NeXT 和 Qwen2-VL,我們顯示結合這些信號能夠可靠地推斷隱私敏感的上下文。最後,我們分析了減輕此漏洞的安全工程權衡,揭示了使用常數工作填充的顯著性能開銷,並提出了安全邊緣 AI 部署的實用設計建議。

4OPS: Structural Difficulty Modeling in Integer Arithmetic Puzzles

2603.25356v1 by Yunus E. Zeytuncu

Arithmetic puzzle games provide a controlled setting for studying difficulty in mathematical reasoning tasks, a core challenge in adaptive learning systems. We investigate the structural determinants of difficulty in a class of integer arithmetic puzzles inspired by number games. We formalize the problem and develop an exact dynamic-programming solver that enumerates reachable targets, extracts minimal-operation witnesses, and enables large-scale labeling. Using this solver, we construct a dataset of over 3.4 million instances and define difficulty via the minimum number of operations required to reach a target. We analyze the relationship between difficulty and solver-derived features. While baseline machine learning models based on bag- and target-level statistics can partially predict solvability, they fail to reliably distinguish easy instances. In contrast, we show that difficulty is fully determined by a small set of interpretable structural attributes derived from exact witnesses. In particular, the number of input values used in a minimal construction serves as a minimal sufficient statistic for difficulty under this labeling. These results provide a transparent, computationally grounded account of puzzle difficulty that bridges symbolic reasoning and data-driven modeling. The framework supports explainable difficulty estimation and principled task sequencing, with direct implications for adaptive arithmetic learning and intelligent practice systems.

摘要:算術謎題遊戲提供了一個受控的環境,用於研究數學推理任務中的難度,這是自適應學習系統中的核心挑戰。我們研究了一類受數字遊戲啟發的整數算術謎題中的難度結構決定因素。我們將問題形式化,並開發了一個精確的動態規劃求解器,該求解器枚舉可達目標,提取最小操作證據,並實現大規模標記。利用這個求解器,我們構建了一個超過340萬個實例的數據集,並通過達到目標所需的最小操作數來定義難度。我們分析了難度與求解器衍生特徵之間的關係。雖然基於袋裝和目標級統計的基線機器學習模型可以部分預測可解性,但它們無法可靠地區分簡單實例。相反,我們顯示難度完全由一小組可解釋的結構屬性決定,這些屬性來自精確的證據。特別是,在這種標記下,最小構造中使用的輸入值的數量作為難度的最小充分統計量。這些結果提供了一個透明的、基於計算的謎題難度解釋,橋接了符號推理和數據驅動建模。該框架支持可解釋的難度估計和有原則的任務排序,對自適應算術學習和智能練習系統具有直接的影響。

Evaluating Language Models for Harmful Manipulation

2603.25326v2 by Canfer Akbulut, Rasmi Elasmar, Abhishek Roy, Anthony Payne, Priyanka Suresh, Lujain Ibrahim, Seliem El-Sayed, Charvi Rastogi, Ashyana Kachra, Will Hawkins, Kristian Lum, Laura Weidinger

Interest in the concept of AI-driven harmful manipulation is growing, yet current approaches to evaluating it are limited. This paper introduces a framework for evaluating harmful AI manipulation via context-specific human-AI interaction studies. We illustrate the utility of this framework by assessing an AI model with 10,101 participants spanning interactions in three AI use domains (public policy, finance, and health) and three locales (US, UK, and India). Overall, we find that that the tested model can produce manipulative behaviours when prompted to do so and, in experimental settings, is able to induce belief and behaviour changes in study participants. We further find that context matters: AI manipulation differs between domains, suggesting that it needs to be evaluated in the high-stakes context(s) in which an AI system is likely to be used. We also identify significant differences across our tested geographies, suggesting that AI manipulation results from one geographic region may not generalise to others. Finally, we find that the frequency of manipulative behaviours (propensity) of an AI model is not consistently predictive of the likelihood of manipulative success (efficacy), underscoring the importance of studying these dimensions separately. To facilitate adoption of our evaluation framework, we detail our testing protocols and make relevant materials publicly available. We conclude by discussing open challenges in evaluating harmful manipulation by AI models.

摘要:對於AI驅動的有害操控概念的興趣正在增長,但目前評估該概念的方法仍然有限。本文介紹了一個通過特定情境的人類與AI互動研究來評估有害AI操控的框架。我們通過評估一個擁有10,101名參與者的AI模型來說明這個框架的實用性,這些參與者的互動涵蓋了三個AI使用領域(公共政策、金融和健康)以及三個地區(美國、英國和印度)。總體而言,我們發現該模型在被提示時能夠產生操控行為,並且在實驗環境中能夠引起研究參與者的信念和行為變化。我們進一步發現情境是重要的:AI操控在不同領域之間存在差異,這表明需要在AI系統可能被使用的高風險情境中進行評估。我們還發現我們測試的地理區域之間存在顯著差異,這表明來自某一地理區域的AI操控結果可能無法推廣到其他地區。最後,我們發現AI模型的操控行為頻率(傾向)並不總是能預測操控成功的可能性(效能),這強調了分開研究這些維度的重要性。為了促進我們評估框架的採用,我們詳細說明了我們的測試協議並公開相關材料。我們最後討論了評估AI模型有害操控的開放挑戰。

DAGverse: Building Document-Grounded Semantic DAGs from Scientific Papers

2603.25293v1 by Shu Wan, Saketh Vishnubhatla, Iskander Kushbay, Tom Heffernan, Aaron Belikoff, Raha Moraffah, Huan Liu

Directed Acyclic Graphs (DAGs) are widely used to represent structured knowledge in scientific and technical domains. However, datasets for real-world DAGs remain scarce because constructing them typically requires expert interpretation of domain documents. We study Doc2SemDAG construction: recovering a preferred semantic DAG from a document together with the cited evidence and context that explain it. This problem is challenging because a document may admit multiple plausible abstractions, the intended structure is often implicit, and the supporting evidence is scattered across prose, equations, captions, and figures. To address these challenges, we leverage scientific papers containing explicit DAG figures as a natural source of supervision. In this setting, the DAG figure provides the DAG structure, while the accompanying text provides context and explanation. We introduce DAGverse, a framework for constructing document-grounded semantic DAGs from online scientific papers. Its core component, DAGverse-Pipeline, is a semi-automatic system designed to produce high-precision semantic DAG examples through figure classification, graph reconstruction, semantic grounding, and validation. As a case study, we test the framework for causal DAGs and release DAGverse-1, a dataset of 108 expert-validated semantic DAGs with graph-level, node-level, and edge-level evidence. Experiments show that DAGverse-Pipeline outperforms existing Vision-Language Models on DAG classification and annotation. DAGverse provides a foundation for document-grounded DAG benchmarks and opens new directions for studying structured reasoning grounded in real-world evidence.

摘要:有向無環圖(DAGs)被廣泛用於表示科學和技術領域中的結構化知識。然而,現實世界中的DAG數據集仍然稀缺,因為構建它們通常需要對領域文檔的專家解讀。我們研究Doc2SemDAG的構建:從文檔中恢復一個首選的語義DAG,並提供解釋它的引用證據和上下文。這個問題具有挑戰性,因為一份文檔可能允許多種合理的抽象,預期的結構通常是隱含的,而支持證據則散布在散文、方程式、標題和圖形中。為了解決這些挑戰,我們利用包含明確DAG圖形的科學論文作為自然的監督來源。在這種情況下,DAG圖形提供了DAG結構,而隨附的文本則提供了上下文和解釋。我們引入DAGverse,一個從在線科學論文中構建文檔基礎語義DAG的框架。其核心組件DAGverse-Pipeline是一個半自動系統,旨在通過圖形分類、圖形重建、語義基礎和驗證來生成高精度的語義DAG示例。作為案例研究,我們測試了該框架在因果DAG上的應用,並發布了DAGverse-1,一個包含108個專家驗證的語義DAG的數據集,並提供圖層、節點層和邊層的證據。實驗顯示,DAGverse-Pipeline在DAG分類和標註方面超越了現有的視覺-語言模型。DAGverse為文檔基礎的DAG基準提供了基礎,並為基於現實世界證據的結構推理研究開辟了新方向。

Does Explanation Correctness Matter? Linking Computational XAI Evaluation to Human Understanding

2603.25251v1 by Gregor Baer, Chao Zhang, Isel Grau, Pieter Van Gorp

Explainable AI (XAI) methods are commonly evaluated with functional metrics such as correctness, which computationally estimate how accurately an explanation reflects the model's reasoning. Higher correctness is assumed to produce better human understanding, but this link has not been tested experimentally with controlled levels. We conducted a user study (N=200) that manipulated explanation correctness at four levels (100%, 85%, 70%, 55%) in a time series classification task where participants could not rely on domain knowledge or visual intuition and instead predicted the AI's decisions based on explanations (forward simulation). Correctness affected understanding, but not at every level: performance dropped at 70% and 55% correctness relative to fully correct explanations, while further degradation below 70% produced no additional loss. Rather than shifting performance uniformly, lower correctness decreased the proportion of participants who learned the decision pattern. At the same time, even fully correct explanations did not guarantee understanding, as only a subset of participants achieved high accuracy. Exploratory analyses showed that self-reported ratings correlated with demonstrated performance only when explanations were fully correct and participants had learned the pattern. These findings show that not all differences in functional correctness translate to differences in human understanding, underscoring the need to validate functional metrics against human outcomes.

摘要:可解釋的人工智慧(XAI)方法通常使用功能性指標進行評估,例如正確性,這些指標計算解釋反映模型推理的準確程度。假設更高的正確性會產生更好的人類理解,但這一聯繫尚未在控制水平下進行實驗測試。我們進行了一項用戶研究(N=200),在一個時間序列分類任務中操縱了解釋的正確性,分為四個水平(100%、85%、70%、55%),參與者無法依賴領域知識或視覺直覺,而是根據解釋預測人工智慧的決策(前向模擬)。正確性影響了理解,但並非在每個水平上:在70%和55%正確性下,表現相對於完全正確的解釋有所下降,而在70%以下的進一步降級則未產生額外的損失。較低的正確性並未均勻地轉變表現,而是減少了學習決策模式的參與者比例。同時,即使是完全正確的解釋也無法保證理解,因為只有一部分參與者達到了高準確性。探索性分析顯示,自我報告的評分僅在解釋完全正確且參與者已學習模式時,與實際表現相關聯。這些發現表明,並非所有功能正確性的差異都轉化為人類理解的差異,強調了需要將功能性指標與人類結果進行驗證的必要性。

Factors Influencing the Quality of AI-Generated Code: A Synthesis of Empirical Evidence

2603.25146v1 by Vehid Geruslu, Zulfiyya Aliyeva, Eray Tüzün

Context: The rapid adoption of AI-assisted code generation tools, such as large language models (LLMs), is transforming software development practices. While these tools promise significant productivity gains, concerns regarding the quality, reliability, and security of AI-generated code are increasingly reported in both academia and industry. --Objective: This study aims to systematically synthesize existing empirical evidence on the factors influencing the quality of AI-generated source code and to analyze how these factors impact software quality outcomes across different evaluation contexts. --Method: We conducted a systematic literature review (SLR) following established guidelines, supported by an AI-assisted workflow with human oversight. A total of 24 primary studies were selected through a structured search and screening process across major digital libraries. Data were extracted and analyzed using qualitative, pattern-based evidence synthesis. --Results: The findings reveal that code quality in AI-assisted development is influenced by a combination of human factors, AI system characteristics, and human AI interaction dynamics. Key influencing factors include prompt design, task specification, and developer expertise. The results also show variability in quality outcomes such as correctness, security, maintainability, and complexity across studies, with both improvements and risks reported. --Conclusion: AI-assisted code generation represents a socio-technical shift in software engineering, where achieving high-quality outcomes depends on both technological and human factors. While promising, AI-generated code requires careful validation and integration into development workflows.

摘要:Context: AI輔助的程式碼生成工具(如大型語言模型(LLMs))的快速採用正在改變軟體開發實踐。雖然這些工具承諾顯著的生產力提升,但關於AI生成程式碼的質量、可靠性和安全性的擔憂在學術界和業界中越來越多地被報導。--Objective: 本研究旨在系統性地綜合現有的實證證據,探討影響AI生成源碼質量的因素,並分析這些因素如何影響不同評估背景下的軟體質量結果。--Method: 我們按照既定指導方針進行了系統性文獻回顧(SLR),並在人工監督下支持AI輔助的工作流程。通過結構化的搜索和篩選過程,從主要數字圖書館中選擇了共24項主要研究。數據通過定性、基於模式的證據綜合進行提取和分析。--Results: 研究結果顯示,AI輔助開發中的程式碼質量受到人為因素、AI系統特徵和人機互動動態的綜合影響。主要影響因素包括提示設計、任務規範和開發者專業知識。結果還顯示,不同研究中的質量結果(如正確性、安全性、可維護性和複雜性)存在變異,報告了改進和風險。--Conclusion: AI輔助程式碼生成代表了軟體工程中的社會技術轉變,實現高質量結果依賴於技術和人為因素。雖然前景看好,但AI生成的程式碼需要仔細驗證並整合進開發工作流程中。

An Explainable Ensemble Learning Framework for Crop Classification with Optimized Feature Pyramids and Deep Networks

2603.25070v1 by Syed Rayhan Masud, SK Muktadir Hossain, Md. Ridoy Sarkar, Mohammad Sakib Mahmood, Md. Kishor Morol, Rakib Hossain Sajib

Agriculture is increasingly challenged by climate change, soil degradation, and resource depletion, and hence requires advanced data-driven crop classification and recommendation solutions. This work presents an explainable ensemble learning paradigm that fuses optimized feature pyramids, deep networks, self-attention mechanisms, and residual networks for bolstering crop suitability predictions based on soil characteristics (e.g., pH, nitrogen, potassium) and climatic conditions (e.g., temperature, rainfall). With a dataset comprising 3,867 instances and 29 features from the Ethiopian Agricultural Transformation Agency and NASA, the paradigm leverages preprocessing methods such as label encoding, outlier removal using IQR, normalization through StandardScaler, and SMOTE for balancing classes. A range of machine learning models such as Logistic Regression, K-Nearest Neighbors, Support Vector Machines, Decision Trees, Random Forest, Gradient Boosting, and a new Relative Error Support Vector Machine are compared, with hyperparameter tuning through Grid Search and cross-validation. The suggested "Final Ensemble" meta-ensemble design outperforms with 98.80% accuracy, precision, recall, and F1-score, compared to individual models such as K-Nearest Neighbors (95.56% accuracy). Explainable AI methods, such as SHAP and permutation importance, offer actionable insights, highlighting critical features such as soil pH, nitrogen, and zinc. The paradigm addresses the gap between intricate ML models and actionable agricultural decision-making, fostering sustainability and trust in AI-powered recommendations

摘要:農業正面臨氣候變遷、土壤退化和資源枯竭等挑戰,因此需要先進的數據驅動作物分類和推薦解決方案。這項工作提出了一種可解釋的集成學習範式,融合了優化的特徵金字塔、深度網絡、自注意力機制和殘差網絡,以增強基於土壤特徵(例如 pH、氮、鉀)和氣候條件(例如溫度、降雨)的作物適宜性預測。該範式利用來自埃塞俄比亞農業轉型機構和NASA的數據集,包含3,867個實例和29個特徵,並採用標籤編碼、使用IQR進行異常值移除、通過StandardScaler進行正規化以及SMOTE進行類別平衡等預處理方法。比較了一系列機器學習模型,如邏輯回歸、K最近鄰、支持向量機、決策樹、隨機森林、梯度提升以及一種新的相對誤差支持向量機,並通過網格搜索和交叉驗證進行超參數調整。建議的“最終集成”元集成設計以98.80%的準確率、精確度、召回率和F1分數超越了單獨模型,如K最近鄰(準確率95.56%)。可解釋的AI方法,如SHAP和置換重要性,提供了可行的見解,突顯了土壤pH、氮和鋅等關鍵特徵。該範式填補了複雜機器學習模型與可行農業決策之間的差距,促進了可持續性和對AI驅動推薦的信任。

Efficient Detection of Bad Benchmark Items with Novel Scalability Coefficients

2603.24999v2 by Michael Hardy, Joshua Gilbert, Benjamin Domingue

The validity of assessments, from large-scale AI benchmarks to human classrooms, depends on the quality of individual items, yet modern evaluation instruments often contain thousands of items with minimal psychometric vetting. We introduce a new family of nonparametric scalability coefficients based on interitem isotonic regression for efficiently detecting globally bad items (e.g., miskeyed, ambiguously worded, or construct-misaligned). The central contribution is the signed isotonic $R^2$, which measures the maximal proportion of variance in one item explainable by a monotone function of another while preserving the direction of association via Kendall's $τ$. Aggregating these pairwise coefficients yields item-level scores that sharply separate problematic items from acceptable ones without assuming linearity or committing to a parametric item response model. We show that the signed isotonic $R^2$ is extremal among monotone predictors (it extracts the strongest possible monotone signal between any two items) and show that this optimality property translates directly into practical screening power. Across three AI benchmark datasets (HS Math, GSM8K, MMLU) and two human assessment datasets, the signed isotonic $R^2$ consistently achieves top-tier AUC for ranking bad items above good ones, outperforming or matching a comprehensive battery of classical test theory, item response theory, and dimensionality-based diagnostics. Crucially, the method remains robust under the small-n/large-p conditions typical of AI evaluation, requires only bivariate monotone fits computable in seconds, and handles mixed item types (binary, ordinal, continuous) without modification. It is a lightweight, model-agnostic filter that can materially reduce the reviewer effort needed to find flawed items in modern large-scale evaluation regimes.

摘要:評估的有效性,從大型 AI 基準到人類教室,取決於個別項目的質量,但現代評估工具通常包含數千個項目,且心理計量學的審核極少。我們引入了一種基於項目間等單調回歸的新型非參數可擴展性係數,用於有效檢測全球不良項目(例如,錯誤標記、措辭模糊或構念不對齊)。核心貢獻是有符號的等單調 $R^2$,它測量一個項目中可由另一個項目的單調函數解釋的最大變異比例,同時通過 Kendall 的 $τ$ 保持關聯方向。聚合這些成對係數產生的項目級分數能夠明確區分問題項目與可接受項目,而無需假設線性或承諾於參數項目反應模型。我們展示了有符號的等單調 $R^2$ 在單調預測變數中是極端的(它提取了任何兩個項目之間最強的單調信號),並且這種最佳性特性直接轉化為實際篩選能力。在三個 AI 基準數據集(HS Math、GSM8K、MMLU)和兩個人類評估數據集中,有符號的等單調 $R^2$ 一貫在將不良項目排名高於良好項目方面達到頂級 AUC,超越或匹配了全面的經典測試理論、項目反應理論和基於維度的診斷工具。關鍵是,該方法在 AI 評估中典型的小樣本/大特徵條件下仍然穩健,只需幾秒鐘即可計算的雙變量單調擬合,並且能夠處理混合項目類型(二元、序數、連續)而無需修改。這是一個輕量級、模型無關的過濾器,能夠實質性減少評審者在現代大型評估體系中尋找缺陷項目所需的努力。

Rethinking Health Agents: From Siloed AI to Collaborative Decision Mediators

2603.24986v1 by Ray-Yuan Chung, Xuhai Xu, Ari Pollack

Large language model based health agents are increasingly used by health consumers and clinicians to interpret health information and guide health decisions. However, most AI systems in healthcare operate in siloed configurations, supporting individual users rather than the multi-stakeholder relationships central to healthcare. Such use can fragment understanding and exacerbate misalignment among patients, caregivers, and clinicians. We reframe AI not as a standalone assistant, but as a collaborator embedded within multi-party care interactions. Through a clinically validated fictional pediatric chronic kidney disease case study, we show that breakdowns in adherence stem from fragmented situational awareness and misaligned goals, and that siloed use of general-purpose AI tools does little to address these collaboration gaps. We propose a conceptual framework for designing AI collaborators that surface contextual information, reconcile mental models, and scaffold shared understanding while preserving human decision authority.

摘要:大型語言模型基礎的健康代理人越來越多地被健康消費者和臨床醫生用來解釋健康信息並指導健康決策。然而,大多數醫療保健中的人工智慧系統運作在孤立的配置中,支持個別用戶,而不是醫療保健中多方利益相關者關係的核心。這種使用方式可能會導致理解的碎片化,並加劇患者、照護者和臨床醫生之間的目標不一致。我們將人工智慧重新定義為一個嵌入多方護理互動中的合作者,而不是一個獨立的助手。通過一個臨床驗證的虛構小兒慢性腎病案例研究,我們顯示出遵循的中斷源於情境意識的碎片化和目標的不一致,而通用人工智慧工具的孤立使用對於解決這些協作差距幾乎沒有幫助。我們提出了一個設計人工智慧合作者的概念框架,旨在呈現上下文信息、調和心理模型,並支撐共享理解,同時保留人類的決策權威。

Self-Corrected Image Generation with Explainable Latent Rewards

2603.24965v1 by Yinyi Luo, Hrishikesh Gokhale, Marios Savvides, Jindong Wang, Shengfeng He

Despite significant progress in text-to-image generation, aligning outputs with complex prompts remains challenging, particularly for fine-grained semantics and spatial relations. This difficulty stems from the feed-forward nature of generation, which requires anticipating alignment without fully understanding the output. In contrast, evaluating generated images is more tractable. Motivated by this asymmetry, we propose xLARD, a self-correcting framework that uses multimodal large language models to guide generation through Explainable LAtent RewarDs. xLARD introduces a lightweight corrector that refines latent representations based on structured feedback from model-generated references. A key component is a differentiable mapping from latent edits to interpretable reward signals, enabling continuous latent-level guidance from non-differentiable image-level evaluations. This mechanism allows the model to understand, assess, and correct itself during generation. Experiments across diverse generation and editing tasks show that xLARD improves semantic alignment and visual fidelity while maintaining generative priors. Code is available at https://yinyiluo.github.io/xLARD/.

摘要:儘管在文本到圖像生成方面取得了重大進展,但將輸出與複雜提示對齊仍然具有挑戰性,特別是在細緻的語義和空間關係方面。這一困難源於生成的前饋性質,這要求在未完全理解輸出的情況下預測對齊。相比之下,評估生成的圖像則更為可行。受到這種不對稱性的啟發,我們提出了xLARD,一個自我校正的框架,利用多模態大型語言模型通過可解釋的潛在獎勵來指導生成。xLARD引入了一個輕量級的校正器,根據模型生成的參考的結構化反饋來細化潛在表示。一個關鍵組件是從潛在編輯到可解釋獎勵信號的可微映射,使得從不可微的圖像級評估中實現持續的潛在級指導。這一機制使模型在生成過程中能夠理解、評估和自我校正。跨多樣的生成和編輯任務的實驗顯示,xLARD改善了語義對齊和視覺真實性,同時保持生成先驗。代碼可在https://yinyiluo.github.io/xLARD/獲得。

Can MLLMs Read Students' Minds? Unpacking Multimodal Error Analysis in Handwritten Math

2603.24961v1 by Dingjie Song, Tianlong Xu, Yi-Fan Zhang, Hang Li, Zhiling Yan, Xing Fan, Haoyang Li, Lichao Sun, Qingsong Wen

Assessing student handwritten scratchwork is crucial for personalized educational feedback but presents unique challenges due to diverse handwriting, complex layouts, and varied problem-solving approaches. Existing educational NLP primarily focuses on textual responses and neglects the complexity and multimodality inherent in authentic handwritten scratchwork. Current multimodal large language models (MLLMs) excel at visual reasoning but typically adopt an "examinee perspective", prioritizing generating correct answers rather than diagnosing student errors. To bridge these gaps, we introduce ScratchMath, a novel benchmark specifically designed for explaining and classifying errors in authentic handwritten mathematics scratchwork. Our dataset comprises 1,720 mathematics samples from Chinese primary and middle school students, supporting two key tasks: Error Cause Explanation (ECE) and Error Cause Classification (ECC), with seven defined error types. The dataset is meticulously annotated through rigorous human-machine collaborative approaches involving multiple stages of expert labeling, review, and verification. We systematically evaluate 16 leading MLLMs on ScratchMath, revealing significant performance gaps relative to human experts, especially in visual recognition and logical reasoning. Proprietary models notably outperform open-source models, with large reasoning models showing strong potential for error explanation. All evaluation data and frameworks are publicly available to facilitate further research.

摘要:評估學生的手寫草稿對於個性化的教育反饋至關重要,但由於手寫風格多樣、佈局複雜以及解題方法各異,這也帶來了獨特的挑戰。現有的教育自然語言處理(NLP)主要集中於文本回應,忽視了真實手寫草稿中固有的複雜性和多模態性。目前的多模態大型語言模型(MLLMs)在視覺推理方面表現出色,但通常採取“考生視角”,優先生成正確答案,而不是診斷學生的錯誤。為了彌補這些空白,我們推出了ScratchMath,一個專門設計用於解釋和分類真實手寫數學草稿錯誤的新基準。我們的數據集包含來自中國小學和中學學生的1,720個數學樣本,支持兩個關鍵任務:錯誤原因解釋(ECE)和錯誤原因分類(ECC),並定義了七種錯誤類型。該數據集通過嚴格的人機協作方法進行了精心註釋,涉及多個專家標記、審查和驗證階段。我們系統地評估了16個領先的MLLMs在ScratchMath上的表現,顯示出相對於人類專家的顯著性能差距,特別是在視覺識別和邏輯推理方面。專有模型顯著優於開源模型,大型推理模型在錯誤解釋方面顯示出強大的潛力。所有評估數據和框架均公開可用,以促進進一步的研究。

Explaining, Verifying, and Aligning Semantic Hierarchies in Vision-Language Model Embeddings

2603.26798v1 by Gesina Schwalbe, Mert Keser, Moritz Bayerkuhnlein, Edgar Heinert, Annika Mütze, Marvin Keller, Sparsh Tiwari, Georgii Mikriukov, Diedrich Wolter, Jae Hee Lee, Matthias Rottmann

Vision-language model (VLM) encoders such as CLIP enable strong retrieval and zero-shot classification in a shared image-text embedding space, yet the semantic organization of this space is rarely inspected. We present a post-hoc framework to explain, verify, and align the semantic hierarchies induced by a VLM over a given set of child classes. First, we extract a binary hierarchy by agglomerative clustering of class centroids and name internal nodes by dictionary-based matching to a concept bank. Second, we quantify plausibility by comparing the extracted tree against human ontologies using efficient tree- and edge-level consistency measures, and we evaluate utility via explainable hierarchical tree-traversal inference with uncertainty-aware early stopping (UAES). Third, we propose an ontology-guided post-hoc alignment method that learns a lightweight embedding-space transformation, using UMAP to generate target neighborhoods from a desired hierarchy. Across 13 pretrained VLMs and 4 image datasets, our method finds systematic modality differences: image encoders are more discriminative, while text encoders induce hierarchies that better match human taxonomies. Overall, the results reveal a persistent trade-off between zero-shot accuracy and ontological plausibility and suggest practical routes to improve semantic alignment in shared embedding spaces.

摘要:視覺-語言模型 (VLM) 編碼器如 CLIP 能夠在共享的圖像-文本嵌入空間中實現強大的檢索和零樣本分類,但這個空間的語義組織卻很少被檢查。我們提出了一個事後框架來解釋、驗證並對齊 VLM 在給定一組子類別上所引發的語義層級。首先,我們通過對類別中心進行聚合聚類來提取二元層級,並通過基於字典的匹配將內部節點命名為概念庫。其次,我們通過使用高效的樹狀和邊緣一致性度量,將提取的樹與人類本體進行比較來量化其合理性,並通過可解釋的層級樹遍歷推理與不確定性感知的提前停止 (UAES) 來評估其效用。第三,我們提出了一種本體引導的事後對齊方法,該方法學習一種輕量級的嵌入空間變換,使用 UMAP 從所需的層級生成目標鄰域。在 13 個預訓練的 VLM 和 4 個圖像數據集上,我們的方法發現了系統性的模態差異:圖像編碼器更具區分性,而文本編碼器引發的層級更好地匹配人類分類法。總體而言,結果揭示了零樣本準確性與本體合理性之間持續的權衡,並建議了改善共享嵌入空間中語義對齊的實際途徑。

Sovereign AI at the Front Door of Care: A Physically Unidirectional Architecture for Secure Clinical Intelligence

2603.24898v1 by Vasu Srinivasan, Dhriti Vasu

We present a Sovereign AI architecture for clinical triage in which all inference is performed on-device and inbound data is delivered via a physically unidirectional channel, implemented using receive-only broadcast infrastructure or certified hardware data diodes, with no return path to any external network. This design removes the network-mediated attack surface by construction, rather than attempting to secure it through software controls. The system performs conversational symptom intake, integrates device-captured vitals, and produces structured, triage-aligned clinical records at the point of care. We formalize the security properties of receiver-side unidirectionality and show that the architecture is transport-agnostic across broadcast and diode-enforced deployments. We further analyze threat models, enforcement mechanisms, and deployment configurations, demonstrating how physical one-way data flow enables high-assurance operation in both resource-constrained and high-risk environments. This work positions physically unidirectional channels as a foundational primitive for sovereign, on-device clinical intelligence at the front door of care.

摘要:我們提出了一種主權人工智慧架構,用於臨床分診,其中所有推理都在設備上進行,進入數據通過物理單向通道傳送,該通道使用僅接收的廣播基礎設施或經認證的硬體數據二極體實現,並且沒有返回路徑通往任何外部網絡。這一設計通過構建消除了網絡介導的攻擊面,而不是試圖通過軟體控制來保護它。系統執行對話式症狀收集,整合設備捕獲的生命體徵,並在護理現場產生結構化、與分診對齊的臨床記錄。我們正式化了接收端單向性的安全性質,並顯示該架構在廣播和二極體強制部署中是傳輸無關的。我們進一步分析了威脅模型、執行機制和部署配置,展示了物理單向數據流如何在資源受限和高風險環境中實現高保證操作。這項工作將物理單向通道定位為主權、設備內臨床智能的基礎原語,位於護理的前門。

More Than "Means to an End": Supporting Reasoning with Transparently Designed AI Data Science Processes

2603.24877v1 by Venkatesh Sivaraman, Patrick Vossler, Adam Perer, Julian Hong, Jean Feng

Generative artificial intelligence (AI) tools can now help people perform complex data science tasks regardless of their expertise. While these tools have great potential to help more people work with data, their end-to-end approach does not support users in evaluating alternative approaches and reformulating problems, both critical to solving open-ended tasks in high-stakes domains. In this paper, we reflect on two AI data science systems designed for the medical setting and how they function as tools for thought. We find that success in these systems was driven by constructing AI workflows around intentionally-designed intermediate artifacts, such as readable query languages, concept definitions, or input-output examples. Despite opaqueness in other parts of the AI process, these intermediates helped users reason about important analytical choices, refine their initial questions, and contribute their unique knowledge. We invite the HCI community to consider when and how intermediate artifacts should be designed to promote effective data science thinking.

摘要:生成式人工智慧(AI)工具現在可以幫助人們執行複雜的數據科學任務,而不論他們的專業知識如何。雖然這些工具具有幫助更多人處理數據的巨大潛力,但它們的端到端方法並不支持用戶評估替代方法和重新定義問題,這對於在高風險領域解決開放式任務至關重要。在本文中,我們反思了為醫療環境設計的兩個AI數據科學系統,以及它們如何作為思考工具運作。我們發現,這些系統的成功是通過圍繞故意設計的中介產物構建AI工作流程來驅動的,例如可讀的查詢語言、概念定義或輸入-輸出示例。儘管AI過程的其他部分不透明,但這些中介幫助用戶推理重要的分析選擇、精煉他們的初始問題並貢獻他們獨特的知識。我們邀請HCI社群考慮何時以及如何設計中介產物,以促進有效的數據科學思維。

A Practical Guide Towards Interpreting Time-Series Deep Clinical Predictive Models: A Reproducibility Study

2603.24828v1 by Yongda Fan, John Wu, Andrea Fitzpatrick, Naveen Baskaran, Jimeng Sun, Adam Cross

Clinical decisions are high-stakes and require explicit justification, making model interpretability essential for auditing deep clinical models prior to deployment. As the ecosystem of model architectures and explainability methods expands, critical questions remain: Do architectural features like attention improve explainability? Do interpretability approaches generalize across clinical tasks? While prior benchmarking efforts exist, they often lack extensibility and reproducibility, and critically, fail to systematically examine how interpretability varies across the interplay of clinical tasks and model architectures. To address these gaps, we present a comprehensive benchmark evaluating interpretability methods across diverse clinical prediction tasks and model architectures. Our analysis reveals that: (1) attention when leveraged properly is a highly efficient approach for faithfully interpreting model predictions; (2) black-box interpreters like KernelSHAP and LIME are computationally infeasible for time-series clinical prediction tasks; and (3) several interpretability approaches are too unreliable to be trustworthy. From our findings, we discuss several guidelines on improving interpretability within clinical predictive pipelines. To support reproducibility and extensibility, we provide our implementations via PyHealth, a well-documented open-source framework: https://github.com/sunlabuiuc/PyHealth.

摘要:臨床決策是高風險的,並需要明確的理由,這使得模型可解釋性在部署之前對深度臨床模型的審核變得至關重要。隨著模型架構和可解釋性方法的生態系統不斷擴展,仍然存在一些關鍵問題:像注意力這樣的架構特徵是否能提高可解釋性?可解釋性方法是否能在不同的臨床任務中通用?雖然之前的基準測試工作已經存在,但它們往往缺乏可擴展性和可重現性,並且關鍵的是,未能系統性地檢查可解釋性在臨床任務和模型架構之間的相互作用中是如何變化的。為了填補這些空白,我們提出了一個全面的基準,評估不同臨床預測任務和模型架構下的可解釋性方法。我們的分析揭示了: (1) 當適當利用時,注意力是一種非常有效的方法,可以忠實地解釋模型預測; (2) 像KernelSHAP和LIME這樣的黑箱解釋器在時間序列臨床預測任務中計算上不可行; (3) 幾種可解釋性方法過於不可靠,無法被信任。根據我們的發現,我們討論了幾條改善臨床預測管道中可解釋性的指導方針。為了支持可重現性和可擴展性,我們通過PyHealth提供我們的實現,這是一個文檔完善的開源框架: https://github.com/sunlabuiuc/PyHealth。

PhyDCM: A Reproducible Open-Source Framework for AI-Assisted Brain Tumor Classification from Multi-Sequence MRI

2603.26794v1 by Hayder Saad Abdulbaqi, Mohammed Hadi Rahim, Mohammed Hassan Hadi, Haider Ali Aboud, Ali Hussein Allawi

MRI-based medical imaging has become indispensable in modern clinical diagnosis, particularly for brain tumor detection. However, the rapid growth in data volume poses challenges for conventional diagnostic approaches. Although deep learning has shown strong performance in automated classification, many existing solutions are confined to closed technical architectures, limiting reproducibility and further academic development. PhyDCM is introduced as an open-source software framework that integrates a hybrid classification architecture based on MedViT with standardized DICOM processing and an interactive desktop visualization interface. The system is designed as a modular digital library that separates computational logic from the graphical interface, allowing independent modification and extension of components. Standardized preprocessing, including intensity rescaling and limited data augmentation, ensures consistency across varying MRI acquisition settings. Experimental evaluation on MRI datasets from BRISC2025 and curated Kaggle collections (FigShare, SARTAJ, and Br35H) demonstrates stable diagnostic performance, achieving over 93% classification accuracy across categories. The framework supports structured, exportable outputs and multi-planar reconstruction of volumetric data. By emphasizing transparency, modularity, and accessibility, PhyDCM provides a practical foundation for reproducible AI-driven medical image analysis, with flexibility for future integration of additional imaging modalities.

摘要:MRI 基於醫學影像在現代臨床診斷中已變得不可或缺,特別是在腦腫瘤檢測方面。然而,數據量的快速增長對傳統診斷方法提出了挑戰。儘管深度學習在自動分類中顯示出強大的性能,但許多現有解決方案受限於封閉的技術架構,限制了可重複性和進一步的學術發展。PhyDCM 被引入作為一個開源軟體框架,整合了基於 MedViT 的混合分類架構、標準化的 DICOM 處理和互動式桌面可視化界面。該系統被設計為一個模組化的數位圖書館,將計算邏輯與圖形界面分離,允許獨立修改和擴展組件。標準化的預處理,包括強度重新縮放和有限的數據增強,確保在不同的 MRI 獲取設置中保持一致性。對來自 BRISC2025 和策劃的 Kaggle 收藏(FigShare、SARTAJ 和 Br35H)的 MRI 數據集的實驗評估顯示出穩定的診斷性能,在各類別中達到超過 93% 的分類準確率。該框架支持結構化、可導出的輸出和體積數據的多平面重建。通過強調透明性、模組化和可及性,PhyDCM 為可重複的 AI 驅動醫學影像分析提供了實用的基礎,並具備未來整合其他影像模式的靈活性。

Dissecting Model Failures in Abdominal Aortic Aneurysm Segmentation through Explainability-Driven Analysis

2603.24801v1 by Abu Noman Md Sakib, Merjulah Roby, Zijie Zhang, Satish Muluk, Mark K. Eskandari, Ender A. Finol

Computed tomography image segmentation of complex abdominal aortic aneurysms (AAA) often fails because the models assign internal focus to irrelevant structures or do not focus on thin, low-contrast targets. Where the model looks is the primary training signal, and thus we propose an Explainable AI (XAI) guided encoder shaping framework. Our method computes a dense, attribution-based encoder focus map ("XAI field") from the final encoder block and uses it in two complementary ways: (i) we align the predicted probability mass to the XAI field to promote agreement between focus and output; and (ii) we route the field into a lightweight refinement pathway and a confidence prior that modulates logits at inference, suppressing distractors while preserving subtle structures. The objective terms serve only as control signals; the contribution is the integration of attribution guidance into representation and decoding. We evaluate clinically validated challenging cases curated for failure-prone scenarios. Compared to a base SAM setup, our implementation yields substantial improvements. The observed gains suggest that explicitly optimizing encoder focus via XAI guidance is a practical and effective principle for reliable segmentation in complex scenarios.

摘要:計算機斷層掃描影像分割複雜的腹主動脈瘤(AAA)常常失敗,因為模型將內部焦點分配給不相關的結構,或未能專注於薄且低對比度的目標。模型的觀察位置是主要的訓練信號,因此我們提出了一種可解釋的人工智慧(XAI)引導編碼器塑形框架。我們的方法從最終的編碼器區塊計算出一個密集的、基於歸因的編碼器焦點圖(“XAI場”),並以兩種互補的方式使用它:(i)我們將預測的概率質量與XAI場對齊,以促進焦點與輸出之間的一致性;以及(ii)我們將該場路由到一個輕量級的精煉路徑和一個信心先驗,在推理時調節邏輯,壓制干擾物,同時保留微妙的結構。目標項僅作為控制信號;其貢獻在於將歸因引導整合到表示和解碼中。我們評估了經臨床驗證的挑戰性案例,這些案例是為容易失敗的情境精心策劃的。與基礎的SAM設置相比,我們的實現產生了顯著的改進。觀察到的增益表明,通過XAI引導明確優化編碼器焦點是一個實用且有效的原則,能夠在複雜情境中實現可靠的分割。

A Sociolinguistic Analysis of Automatic Speech Recognition Bias in Newcastle English

2603.24549v1 by Dana Serditova, Kevin Tang

Automatic Speech Recognition (ASR) systems are widely used in everyday communication, education, healthcare, and industry, yet their performance remains uneven across speakers, particularly when dialectal variation diverges from the mainstream accents represented in training data. This study investigates ASR bias through a sociolinguistic analysis of Newcastle English, a regional variety of North-East England that has been shown to challenge current speech recognition technologies. Using spontaneous speech from the Diachronic Electronic Corpus of Tyneside English (DECTE), we evaluate the output of a state-of-the-art commercial ASR system and conduct a fine-grained analysis of more than 3,000 transcription errors. Errors are classified by linguistic domain and examined in relation to social variables including gender, age, and socioeconomic status. In addition, an acoustic case study of selected vowel features demonstrates how gradient phonetic variation contributes directly to misrecognition. The results show that phonological variation accounts for the majority of errors, with recurrent failures linked to dialect-specific features like vowel quality and glottalisation, as well as local vocabulary and non-standard grammatical forms. Error rates also vary across social groups, with higher error frequencies observed for men and for speakers at the extremes of the age spectrum. These findings indicate that ASR errors are not random but socially patterned and can be explained from a sociolinguistic perspective. Thus, the study demonstrates the importance of incorporating sociolinguistic expertise into the evaluation and development of speech technologies and argues that more equitable ASR systems require explicit attention to dialectal variation and community-based speech data.

摘要:自動語音識別(ASR)系統在日常交流、教育、醫療和工業中被廣泛使用,但其性能在不同說話者之間仍然不均衡,特別是當方言變異與訓練數據中所代表的主流口音有所偏離時。這項研究通過對紐卡斯爾英語的社會語言學分析來調查ASR偏見,這是一種來自英格蘭東北部的區域變體,已被證明對當前的語音識別技術構成挑戰。使用來自泰恩賽德英語的歷時電子語料庫(DECTE)的自發語音,我們評估了一個最先進商業ASR系統的輸出,並對3000多個轉錄錯誤進行了細緻的分析。錯誤按語言領域進行分類,並與包括性別、年齡和社會經濟狀況在內的社會變量進行檢查。此外,對選定元音特徵的聲學案例研究展示了漸進音位變異如何直接導致誤識別。結果顯示,音系變異佔據了大多數錯誤,重複出現的失敗與方言特定特徵如元音質量和喉音化,以及當地詞彙和非標準語法形式相關。錯誤率在社會群體之間也有所不同,男性和年齡範圍極端的說話者的錯誤頻率較高。這些發現表明,ASR錯誤並非隨機,而是具有社會模式,並可以從社會語言學的角度進行解釋。因此,這項研究展示了將社會語言學專業知識納入語音技術的評估和開發的重要性,並主張更公平的ASR系統需要明確關注方言變異和基於社區的語音數據。

No Single Metric Tells the Whole Story: A Multi-Dimensional Evaluation Framework for Uncertainty Attributions

2603.24524v1 by Emily Schiller, Teodor Chiaburu, Marco Zullich, Luca Longo

Research on explainable AI (XAI) has frequently focused on explaining model predictions. More recently, methods have been proposed to explain prediction uncertainty by attributing it to input features (uncertainty attributions). However, the evaluation of these methods remains inconsistent as studies rely on heterogeneous proxy tasks and metrics, hindering comparability. We address this by aligning uncertainty attributions with the well-established Co-12 framework for XAI evaluation. We propose concrete implementations for the correctness, consistency, continuity, and compactness properties. Additionally, we introduce conveyance, a property tailored to uncertainty attributions that evaluates whether controlled increases in epistemic uncertainty reliably propagate to feature-level attributions. We demonstrate our evaluation framework with eight metrics across combinations of uncertainty quantification and feature attribution methods on tabular and image data. Our experiments show that gradient-based methods consistently outperform perturbation-based approaches in consistency and conveyance, while Monte-Carlo dropconnect outperforms Monte-Carlo dropout in most metrics. Although most metrics rank the methods consistently across samples, inter-method agreement remains low. This suggests no single metric sufficiently evaluates uncertainty attribution quality. The proposed evaluation framework contributes to the body of knowledge by establishing a foundation for systematic comparison and development of uncertainty attribution methods.

摘要:研究可解釋人工智慧(XAI)時常專注於解釋模型預測。最近,已提出方法來解釋預測不確定性,通過將其歸因於輸入特徵(不確定性歸因)。然而,這些方法的評估仍然不一致,因為研究依賴於異質的代理任務和指標,妨礙了可比性。我們通過將不確定性歸因與成熟的 Co-12 框架對齊,來解決這個問題。我們為正確性、一致性、連續性和緊湊性屬性提出具體實現。此外,我們引入了傳遞性,這是一個針對不確定性歸因的屬性,評估控制的認知不確定性增長是否可靠地傳遞到特徵級別的歸因。我們用八個指標展示我們的評估框架,涵蓋不確定性量化和特徵歸因方法在表格和圖像數據上的組合。我們的實驗顯示,基於梯度的方法在一致性和傳遞性方面始終優於基於擾動的方法,而蒙特卡羅 dropconnect 在大多數指標上優於蒙特卡羅 dropout。儘管大多數指標在樣本之間一致地對這些方法進行排名,但方法之間的一致性仍然較低。這表明沒有單一指標能夠充分評估不確定性歸因的質量。所提出的評估框架通過建立系統比較和不確定性歸因方法發展的基礎,為知識體系做出了貢獻。

Multi-Agent Reasoning with Consistency Verification Improves Uncertainty Calibration in Medical MCQA

2603.24481v1 by John Ray B. Martinez

Miscalibrated confidence scores are a practical obstacle to deploying AI in clinical settings. A model that is always overconfident offers no useful signal for deferral. We present a multi-agent framework that combines domain-specific specialist agents with Two-Phase Verification and S-Score Weighted Fusion to improve both calibration and discrimination in medical multiple-choice question answering. Four specialist agents (respiratory, cardiology, neurology, gastroenterology) generate independent diagnoses using Qwen2.5-7B-Instruct. Each diagnosis is then subjected to a two-phase self-verification process that measures internal consistency and produces a Specialist Confidence Score (S-score). The S-scores drive a weighted fusion strategy that selects the final answer and calibrates the reported confidence. We evaluate across four experimental settings, covering 100-question and 250-question high-disagreement subsets of both MedQA-USMLE and MedMCQA. Calibration improvement is the central finding, with ECE reduced by 49-74% across all four settings, including the harder MedMCQA benchmark where these gains persist even when absolute accuracy is constrained by knowledge-intensive recall demands. On MedQA-250, the full system achieves ECE = 0.091 (74.4% reduction over the single-specialist baseline) and AUROC = 0.630 (+0.056) at 59.2% accuracy. Ablation analysis identifies Two-Phase Verification as the primary calibration driver and multi-agent reasoning as the primary accuracy driver. These results establish that consistency-based verification produces more reliable uncertainty estimates across diverse medical question types, providing a practical confidence signal for deferral in safety-critical clinical AI applications.

摘要:錯誤校準的信心分數是將 AI 部署於臨床環境中的一個實際障礙。始終過於自信的模型無法提供有用的延遲信號。我們提出了一個多代理框架,將領域特定的專家代理與兩階段驗證和 S-Score 加權融合相結合,以改善醫療多選題回答中的校準和辨別能力。四個專家代理(呼吸科、心臟科、神經科、腸胃科)使用 Qwen2.5-7B-Instruct 生成獨立的診斷。每個診斷隨後經過一個兩階段自我驗證過程,該過程測量內部一致性並產生專家信心分數(S-score)。S-score 驅動一個加權融合策略,選擇最終答案並校準報告的信心。我們在四個實驗設置中進行評估,涵蓋了 100 題和 250 題的高不一致性子集,包括 MedQA-USMLE 和 MedMCQA。校準改善是主要發現,所有四個設置中的 ECE 減少了 49-74%,包括在更難的 MedMCQA 基準中,這些增益在絕對準確性受到知識密集型回憶需求限制時仍然持續。在 MedQA-250 上,完整系統實現 ECE = 0.091(比單一專家基線減少 74.4%)和 AUROC = 0.630(+0.056),準確率為 59.2%。消融分析確定兩階段驗證是主要的校準驅動因素,而多代理推理是主要的準確性驅動因素。這些結果證明基於一致性的驗證在各種醫療問題類型中產生了更可靠的不確定性估計,為安全關鍵的臨床 AI 應用提供了實用的信心信號以進行延遲。

Integrating Causal Machine Learning into Clinical Decision Support Systems: Insights from Literature and Practice

2603.24448v1 by Domenique Zipperling, Lukas Schmidt, Benedikt Hahn, Niklas Kühl, Steven Kimbrough

Current clinical decision support systems (CDSSs) typically base their predictions on correlation, not causation. In recent years, causal machine learning (ML) has emerged as a promising way to improve decision-making with CDSSs by offering interpretable, treatment-specific reasoning. However, existing research often emphasizes model development rather than designing clinician-facing interfaces. To address this gap, we investigated how CDSSs based on causal ML should be designed to effectively support collaborative clinical decision-making. Using a design science research methodology, we conducted a structured literature review and interviewed experienced physicians. From these, we derived eight empirically grounded design requirements, developed seven design principles, and proposed nine practical design features. Our results establish guidance for designing CDSSs that deliver causal insights, integrate seamlessly into clinical workflows, and support trust, usability, and human-AI collaboration. We also reveal tensions around automation, responsibility, and regulation, highlighting the need for an adaptive certification process for ML-based medical products.

摘要:目前的臨床決策支持系統 (CDSSs) 通常基於相關性而非因果關係來進行預測。近年來,因果機器學習 (ML) 已成為改善 CDSS 決策的有希望的方法,因為它提供了可解釋的、針對治療的推理。然而,現有研究往往強調模型開發,而非設計面向臨床醫生的介面。為了填補這一空白,我們調查了基於因果 ML 的 CDSS 應如何設計,以有效支持協作的臨床決策。使用設計科學研究方法論,我們進行了結構化文獻回顧並訪問了經驗豐富的醫生。從中,我們得出了八項基於實證的設計要求,開發了七項設計原則,並提出了九項實用的設計特徵。我們的結果為設計能提供因果見解、無縫整合到臨床工作流程中並支持信任、可用性和人機協作的 CDSS 提供了指導。我們還揭示了自動化、責任和監管之間的緊張關係,突顯了對基於 ML 的醫療產品需要一個適應性認證過程的需求。

Enes Causal Discovery

2603.24436v3 by Alexis Kafantaris

Enes The proposed architecture is a mixture of experts, which allows for the model entities, such as the causal relationships, to be further parameterized. More specifically, an attempt is made to exploit a neural net as implementing neurons poses a great challenge for this dataset. To explain, a simple and fast Pearson coefficient linear model usually achieves good scores. An aggressive baseline that requires a really good model to overcome that is. Moreover, there are major limitations when it comes to causal discovery of observational data. Unlike the sachs one did not use interventions but only prior knowledge; the most prohibiting limitation is that of the data which is addressed. Thereafter, the method and the model are described and after that the results are presented.

摘要:Enes 提出的架構是一種專家混合模型,這使得模型實體,例如因果關係,可以進一步參數化。更具體地說,嘗試利用神經網絡,因為對於這個數據集,實現神經元是一個巨大的挑戰。解釋來說,一個簡單且快速的皮爾遜係數線性模型通常能夠取得良好的分數。一個激進的基準需要一個非常好的模型來克服這一點。此外,在觀察數據的因果發現方面存在重大限制。與 sachs 不同的是,這次並未使用干預,而僅僅依賴先驗知識;最具限制性的問題是所處理的數據。之後,將描述該方法和模型,然後呈現結果。

From Untamed Black Box to Interpretable Pedagogical Orchestration: The Ensemble of Specialized LLMs Architecture for Adaptive Tutoring

2603.23990v1 by Nizam Kadir

Monolithic Large Language Models (LLMs) used in educational dialogue often behave as "black boxes," where pedagogical decisions are implicit and difficult to audit, frequently violating instructional constraints by providing answers too early. We introduce the Ensemble of Specialized LLMS (ES-LLMS) architecture that separates decision-making from wording. Pedagogical actions are selected by a deterministic rules-based orchestrator coordinating specialized agents covering tutoring, assessment, feedback, scaffolding, motivation and ethics-guided by an interpretable Bayesian Knowledge Tracing (BKT) student model. An LLM renderer surface-realizes the chosen action in natural language. This design emphasizes reliability and controllability: constraints such as "attempt-before-hint" and hint caps are enforced as explicit rules, and the system logs per-turn agent traces and constraint checks. Validation of pedagogical quality via human expert reviewers (N=6) and a multi-LLM-as-Judge panel (six state-of-the-art models) showed that ES-LLMs were preferred in 91.7% and 79.2% of cases, respectively. The architecture significantly outperformed monolithic baselines across all seven dimensions, particularly in Scaffolding & Guidance, and Trust & Explainability. Furthermore, a Monte Carlo simulation (N=2,400) exposed a "Mastery Gain Paradox," where monolithic tutors inflated short-term performance through over-assistance. In contrast, ES-LLMs achieved 100% adherence to pedagogical constraints (e.g., attempt-before-hint) and a 3.3x increase in hint efficiency. Operationally, ES-LLMs reduced costs by 54% and latency by 22% by utilizing stateless prompts. We conclude that structural decoupling is essential for transforming stochastic models into trustworthy, verifiable and resource-efficient pedagogical agents.

摘要:單體大型語言模型(LLMs)在教育對話中常常表現得像是「黑箱」,其教學決策是隱含的且難以審計,經常違反教學約束,過早提供答案。我們介紹了專門化LLMS的集成架構(ES-LLMS),該架構將決策與措辭分離。教學行動由一個基於確定性規則的協調者選擇,該協調者協調涵蓋輔導、評估、反饋、支架、動機和倫理的專門代理,並由可解釋的貝葉斯知識追蹤(BKT)學生模型指導。一個LLM渲染器以自然語言實現所選擇的行動。這一設計強調可靠性和可控性:如「嘗試後再提示」和提示上限等約束作為明確規則被強制執行,系統記錄每回合的代理痕跡和約束檢查。通過人類專家評審(N=6)和多LLM作為評審小組(六個最先進模型)對教學質量的驗證顯示,ES-LLMs在91.7%和79.2%的情況下更受偏好。該架構在所有七個維度上顯著超越單體基準,特別是在支架與指導以及信任與可解釋性方面。此外,一項蒙特卡羅模擬(N=2,400)揭示了「精通增益悖論」,單體輔導者通過過度協助膨脹了短期表現。相比之下,ES-LLMs在教學約束(例如,嘗試後再提示)上達到了100%的遵循率,並且提示效率提高了3.3倍。在運營上,ES-LLMs通過利用無狀態提示將成本降低了54%,延遲降低了22%。我們得出結論,結構解耦對於將隨機模型轉變為值得信賴、可驗證和資源高效的教學代理至關重要。

Generative AI User Experience: Developing Human--AI Epistemic Partnership

2603.23863v1 by Xiaoming Zhai

Generative AI (GenAI) has rapidly entered education, yet its user experience is often explained through adoption-oriented constructs such as usefulness, ease of use, and engagement. We argue that these constructs are no longer sufficient because systems such as ChatGPT do not merely support learning tasks but also participate in knowledge construction. Existing theories cannot explain why GenAI frequently produces experiences characterized by negotiated authority, redistributed cognition, and accountability tension. To address this gap, this paper develops the Human--AI Epistemic Partnership Theory (HAEPT), explaining the GenAI user experience as a form of epistemic partnership that features a dynamic negotiation of three interlocking contracts: epistemic, agency, and accountability. We argue that findings on trust, over-reliance, academic integrity, teacher caution, and relational interaction about GenAI can be reinterpreted as tensions within these contracts rather than as isolated issues. Instead of holding a single, stable view of GenAI, users adjust how they relate to it over time through calibration cycles. These repeated interactions account for why trust and skepticism often coexist and for how partnership modes describe recurrent configurations of human--AI collaboration across tasks. To demonstrate the usefulness of HAEPT, we applied it to analyze the UX of collaborative learning with AI speakers and AI-facilitated scientific argumentation, illustrating different contract configurations.

摘要:生成式人工智慧(GenAI)迅速進入教育領域,但其使用者體驗通常透過以採用為導向的構念來解釋,例如有用性、易用性和參與感。我們主張這些構念已不再足夠,因為像 ChatGPT 這樣的系統不僅支持學習任務,還參與知識建構。現有理論無法解釋為什麼 GenAI 經常產生以協商權威、再分配認知和責任緊張為特徵的體驗。為了解決這一空白,本文發展了人類--AI 認識夥伴關係理論(HAEPT),將 GenAI 使用者體驗解釋為一種認識夥伴關係,特徵是三個互鎖契約的動態協商:認識、代理和責任。我們主張有關信任、過度依賴、學術誠信、教師謹慎和與 GenAI 的關係互動的研究結果可以被重新詮釋為這些契約中的緊張,而不是孤立的問題。使用者不再對 GenAI 持有單一、穩定的看法,而是通過校準循環隨著時間調整與其的關係。這些重複的互動解釋了為什麼信任和懷疑經常共存,以及夥伴關係模式如何描述人類--AI 合作在任務中的重複配置。為了展示 HAEPT 的實用性,我們將其應用於分析與 AI 語音助手和 AI 促進的科學論證的協作學習的使用者體驗,說明不同的契約配置。

Causal AI For AMS Circuit Design: Interpretable Parameter Effects Analysis

2603.24618v1 by Mohyeu Hussain, David Koblah, Reiner Dizon-Paradis, Domenic Forte

Analog-mixed-signal (AMS) circuits are highly non-linear and operate on continuous real-world signals, making them far more difficult to model with data-driven AI than digital blocks. To close the gap between structured design data (device dimensions, bias voltages, etc.) and real-world performance, we propose a causal-inference framework that first discovers a directed-acyclic graph (DAG) from SPICE simulation data and then quantifies parameter impact through Average Treatment Effect (ATE) estimation. The approach yields human-interpretable rankings of design knobs and explicit 'what-if' predictions, enabling designers to understand trade-offs in sizing and topology. We evaluate the pipeline on three operational-amplifier families (OTA, telescopic, and folded-cascode) implemented in TSMC 65nm and benchmark it against a baseline neural-network (NN) regressor. Across all circuits the causal model reproduces simulation-based ATEs with an average absolute error of less than 25%, whereas the neural network deviates by more than 80% and frequently predicts the wrong sign. These results demonstrate that causal AI provides both higher accuracy and explainability, paving the way for more efficient, trustworthy AMS design automation.

摘要:類比混合信號(AMS)電路具有高度非線性,並且在連續的現實世界信號上運作,使得它們比數位模塊更難以使用數據驅動的人工智慧進行建模。為了縮小結構化設計數據(設備尺寸、偏壓電壓等)與現實世界性能之間的差距,我們提出了一個因果推斷框架,該框架首先從SPICE模擬數據中發現一個有向無環圖(DAG),然後通過平均處理效果(ATE)估計來量化參數影響。這種方法產生了可被人理解的設計調整排名和明確的「如果怎樣」預測,使設計師能夠理解尺寸和拓撲之間的權衡。我們在三個運算放大器系列(OTA、望遠鏡型和折疊級聯)上評估了該管道,這些系列在TSMC 65nm上實現,並將其與基準神經網絡(NN)回歸器進行比較。在所有電路中,因果模型再現基於模擬的ATE,平均絕對誤差小於25%,而神經網絡的偏差超過80%,並且經常預測錯誤的符號。這些結果表明,因果人工智慧提供了更高的準確性和可解釋性,為更高效、值得信賴的AMS設計自動化鋪平了道路。

Learning What Can Be Picked: Active Reachability Estimation for Efficient Robotic Fruit Harvesting

2603.23679v1 by Nur Afsa Syeda, Mohamed Elmahallawy, Luis Fernando de la Torre, John Miller

Agriculture remains a cornerstone of global health and economic sustainability, yet labor-intensive tasks such as harvesting high-value crops continue to face growing workforce shortages. Robotic harvesting systems offer a promising solution; however, their deployment in unstructured orchard environments is constrained by inefficient perception-to-action pipelines. In particular, existing approaches often rely on exhaustive inverse kinematics or motion planning to determine whether a target fruit is reachable, leading to unnecessary computation and delayed decision-making. Our approach combines RGB-D perception with active learning to directly learn reachability as a binary decision problem. We then leverage active learning to selectively query the most informative samples for reachability labeling, significantly reducing annotation effort while maintaining high predictive accuracy. Extensive experiments demonstrate that the proposed framework achieves accurate reachability prediction with substantially fewer labeled samples, yielding approximately 6--8% higher accuracy than random sampling and enabling label-efficient adaptation to new orchard configurations. Among the evaluated strategies, entropy- and margin-based sampling outperform Query-by-Committee and standard uncertainty sampling in low-label regimes, while all strategies converge to comparable performance as the labeled set grows. These results highlight the effectiveness of active learning for task-level perception in agricultural robotics and position our approach as a scalable alternative to computation-heavy kinematic reachability analysis. Our code is available through https://github.com/wsu-cyber-security-lab-ai/active-learning.

摘要:農業仍然是全球健康和經濟可持續性的基石,但收穫高價值作物等勞動密集型任務仍面臨日益嚴重的勞動力短缺。機器人收穫系統提供了一個有前景的解決方案;然而,它們在非結構化果園環境中的部署受到效率低下的感知到行動管道的限制。特別是,現有的方法通常依賴於全面的逆運動學或運動規劃來確定目標水果是否可達,導致不必要的計算和延遲的決策。 我們的方法將RGB-D感知與主動學習相結合,直接將可達性學習作為一個二元決策問題。然後,我們利用主動學習選擇性地查詢最具信息量的樣本以進行可達性標註,顯著減少標註工作量,同時保持高預測準確性。廣泛的實驗表明,所提出的框架在標註樣本顯著較少的情況下實現了準確的可達性預測,準確率比隨機抽樣高出約6--8%,並能夠高效適應新的果園配置。在評估的策略中,基於熵和邊際的抽樣在低標籤範疇中表現優於委員會查詢和標準不確定性抽樣,而所有策略隨著標註集的增長而收斂到可比的性能。這些結果突顯了主動學習在農業機器人任務級感知中的有效性,並將我們的方法定位為計算密集型運動學可達性分析的可擴展替代方案。我們的代碼可通過 https://github.com/wsu-cyber-security-lab-ai/active-learning 獲取。

Evaluating a Multi-Agent Voice-Enabled Smart Speaker for Care Homes: A Safety-Focused Framework

2603.23625v1 by Zeinab Dehghani, Rameez Raja Kureshi, Koorosh Aslansefat, Faezeh Alsadat Abedi, Dhavalkumar Thakker, Lisa Greaves, Bhupesh Kumar Mishra, Baseer Ahmad, Tanaya Maslekar

Artificial intelligence (AI) is increasingly being explored in health and social care to reduce administrative workload and allow staff to spend more time on patient care. This paper evaluates a voice-enabled Care Home Smart Speaker designed to support everyday activities in residential care homes, including spoken access to resident records, reminders, and scheduling tasks. A safety-focused evaluation framework is presented that examines the system end-to-end, combining Whisper-based speech recognition with retrieval-augmented generation (RAG) approaches (hybrid, sparse, and dense). Using supervised care-home trials and controlled testing, we evaluated 330 spoken transcripts across 11 care categories, including 184 reminder-containing interactions. These evaluations focus on (i) correct identification of residents and care categories, (ii) reminder recognition and extraction, and (iii) end-to-end scheduling correctness under uncertainty (including safe deferral/clarification). Given the safety-critical nature of care homes, particular attention is also paid to reliability in noisy environments and across diverse accents, supported by confidence scoring, clarification prompts, and human-in-the-loop oversight. In the best-performing configuration (GPT-5.2), resident ID and care category matching reached 100% (95% CI: 98.86-100), while reminder recognition reached 89.09\% (95% CI: 83.81-92.80) with zero missed reminders (100% recall) but some false positives. End-to-end scheduling via calendar integration achieved 84.65% exact reminder-count agreement (95% CI: 78.00-89.56), indicating remaining edge cases in converting informal spoken instructions into actionable events. The findings suggest that voice-enabled systems, when carefully evaluated and appropriately safeguarded, can support accurate documentation, effective task management, and trustworthy use of AI in care home settings.

摘要:人工智慧(AI)在健康和社會照護領域的應用越來越受到關注,以減少行政工作負擔並讓員工能夠花更多時間在病人照護上。本文評估了一款語音啟用的護理之家智能音箱,旨在支持住宅護理機構的日常活動,包括對居民記錄的語音訪問、提醒和排程任務。提出了一個以安全為重點的評估框架,對系統進行端到端的檢查,結合基於Whisper的語音識別和檢索增強生成(RAG)方法(混合、稀疏和密集)。通過監督的護理之家試驗和控制測試,我們評估了330個語音轉錄,涵蓋11個護理類別,包括184個包含提醒的互動。這些評估專注於(i)正確識別居民和護理類別,(ii)提醒的識別和提取,以及(iii)在不確定性下的端到端排程正確性(包括安全的延遲/澄清)。鑒於護理之家具有安全關鍵的特性,特別關注在嘈雜環境和不同口音下的可靠性,並通過信心評分、澄清提示和人類監控進行支持。在表現最佳的配置(GPT-5.2)中,居民ID和護理類別的匹配達到了100%(95% CI: 98.86-100),而提醒識別達到了89.09%(95% CI: 83.81-92.80),且沒有漏掉的提醒(100%回憶),但存在一些假陽性。通過日曆整合的端到端排程達到了84.65%的準確提醒數量一致性(95% CI: 78.00-89.56),顯示在將非正式的語音指令轉換為可行事件方面仍然存在邊緣案例。研究結果表明,當語音啟用系統經過仔細評估並適當保障時,可以支持準確的文檔記錄、有效的任務管理,以及在護理之家環境中可信賴地使用AI。

AI Generalisation Gap In Comorbid Sleep Disorder Staging

2603.23582v2 by Saswata Bose, Suvadeep Maiti, Shivam Kumar Sharma, Mythirayee S, Tapabrata Chakraborti, Srijitesh Rajendran, Raju S. Bapi

Accurate sleep staging is essential for diagnosing OSA and hypopnea in stroke patients. Although PSG is reliable, it is costly, labor-intensive, and manually scored. While deep learning enables automated EEG-based sleep staging in healthy subjects, our analysis shows poor generalization to clinical populations with disrupted sleep. Using Grad-CAM interpretations, we systematically demonstrate this limitation. We introduce iSLEEPS, a newly clinically annotated ischemic stroke dataset (to be publicly released), and evaluate a SE-ResNet plus bidirectional LSTM model for single-channel EEG sleep staging. As expected, cross-domain performance between healthy and diseased subjects is poor. Attention visualizations, supported by clinical expert feedback, show the model focuses on physiologically uninformative EEG regions in patient data. Statistical and computational analyses further confirm significant sleep architecture differences between healthy and ischemic stroke cohorts, highlighting the need for subject-aware or disease-specific models with clinical validation before deployment. A summary of the paper and the code is available at https://himalayansaswatabose.github.io/iSLEEPS_Explainability.github.io/

摘要:準確的睡眠分期對於診斷中風患者的阻塞性睡眠呼吸暫停症(OSA)和低通氣(hypopnea)至關重要。雖然多導睡眠圖(PSG)是可靠的,但成本高昂、勞動密集且需手動評分。儘管深度學習使得健康受試者的自動化腦電圖(EEG)睡眠分期成為可能,但我們的分析顯示對於睡眠中斷的臨床人群的泛化能力較差。通過使用Grad-CAM解釋,我們系統性地展示了這一限制。我們引入了iSLEEPS,一個新近臨床註釋的缺血性中風數據集(將公開發佈),並評估一個SE-ResNet加上雙向LSTM模型用於單通道EEG睡眠分期。如預期的那樣,健康與病患受試者之間的跨域性能較差。由臨床專家反饋支持的注意力可視化顯示,該模型在患者數據中專注於生理上無信息的EEG區域。統計和計算分析進一步確認健康與缺血性中風群體之間存在顯著的睡眠結構差異,突顯了在部署之前需要具備主體感知或疾病特定模型並經過臨床驗證的必要性。論文摘要和代碼可在 https://himalayansaswatabose.github.io/iSLEEPS_Explainability.github.io/ 獲得。

A Learning Method with Gap-Aware Generation for Heterogeneous DAG Scheduling

2603.23249v1 by Ruisong Zhou, Haijun Zou, Li Zhou, Chumin Sun, Zaiwen Wen

Efficient scheduling of directed acyclic graphs (DAGs) in heterogeneous environments is challenging due to resource capacities and dependencies. In practice, the need for adaptability across environments with varying resource pools and task types, alongside rapid schedule generation, complicates these challenges. We propose WeCAN, an end-to-end reinforcement learning framework for heterogeneous DAG scheduling that addresses task--pool compatibility coefficients and generation-induced optimality gaps. It adopts a two-stage single-pass design: a single forward pass produces task--pool scores and global parameters, followed by a generation map that constructs schedules without repeated network calls. Its weighted cross-attention encoder models task--pool interactions gated by compatibility coefficients, and is size-agnostic to environment fluctuations. Moreover, widely used list-scheduling maps can incur generation-induced optimality gaps from restricted reachability. We introduce an order-space analysis that characterizes the reachable set of generation maps via feasible schedule orders, explains the mechanism behind generation-induced gaps, and yields sufficient conditions for gap elimination. Guided by these conditions, we design a skip-extended realization with an analytically parameterized decreasing skip rule, which enlarges the reachable order set while preserving single-pass efficiency. Experiments on computation graphs and real-world TPC-H DAGs demonstrate improved makespan over strong baselines, with inference time comparable to classical heuristics and faster than multi-round neural schedulers.

摘要:有效地在異質環境中安排有向無環圖(DAG)是一項挑戰,因為資源容量和依賴性。在實踐中,需在資源池和任務類型各異的環境中適應,以及快速生成排程,這使得這些挑戰更加複雜。我們提出了WeCAN,一個端到端的強化學習框架,用於異質DAG排程,解決任務-池兼容性係數和生成引起的最優性差距。它採用兩階段單次通過設計:單次前向傳播產生任務-池分數和全局參數,隨後生成一個生成圖,構建無需重複網絡調用的排程。其加權交叉注意力編碼器建模任務-池互動,並由兼容性係數進行門控,對環境波動不敏感。此外,廣泛使用的列表排程圖可能因可達性受限而產生生成引起的最優性差距。我們引入了一種順序空間分析,通過可行的排程順序來表徵生成圖的可達集,解釋生成引起的差距背後的機制,並產生消除差距的充分條件。在這些條件的指導下,我們設計了一個跳過擴展的實現,具有分析參數化的遞減跳過規則,這在保持單次通過效率的同時擴大了可達順序集。在計算圖和現實世界的TPC-H DAG上進行的實驗顯示,與強基準相比,縮短了完成時間,推理時間與傳統啟發式方法相當,且比多輪神經排程器更快。

Why AI-Generated Text Detection Fails: Evidence from Explainable AI Beyond Benchmark Accuracy

2603.23146v1 by Shushanta Pudasaini, Luis Miralles-Pechuán, David Lillis, Marisa Llorens Salvador

The widespread adoption of Large Language Models (LLMs) has made the detection of AI-Generated text a pressing and complex challenge. Although many detection systems report high benchmark accuracy, their reliability in real-world settings remains uncertain, and their interpretability is often unexplored. In this work, we investigate whether contemporary detectors genuinely identify machine authorship or merely exploit dataset-specific artefacts. We propose an interpretable detection framework that integrates linguistic feature engineering, machine learning, and explainable AI techniques. When evaluated on two prominent benchmark corpora, namely PAN CLEF 2025 and COLING 2025, our model trained on 30 linguistic features achieves leaderboard-competitive performance, attaining an F1 score of 0.9734. However, systematic cross-domain and cross-generator evaluation reveals substantial generalisation failure: classifiers that excel in-domain degrade significantly under distribution shift. Using SHAP- based explanations, we show that the most influential features differ markedly between datasets, indicating that detectors often rely on dataset-specific stylistic cues rather than stable signals of machine authorship. Further investigation with in-depth error analysis exposes a fundamental tension in linguistic-feature-based AI text detection: the features that are most discriminative on in-domain data are also the features most susceptible to domain shift, formatting variation, and text-length effects. We believe that this knowledge helps build AI detectors that are robust across different settings. To support replication and practical use, we release an open-source Python package that returns both predictions and instance-level explanations for individual texts.

摘要:大型語言模型(LLMs)的廣泛採用使得檢測 AI 生成文本成為一個迫切且複雜的挑戰。雖然許多檢測系統報告了高基準準確率,但它們在現實世界環境中的可靠性仍然不確定,且其可解釋性通常未被探討。在本研究中,我們調查當代檢測器是否真正識別機器作者身份,或僅僅利用特定數據集的特徵。我們提出了一個可解釋的檢測框架,整合了語言特徵工程、機器學習和可解釋的 AI 技術。在兩個著名的基準語料庫上進行評估,即 PAN CLEF 2025 和 COLING 2025,我們基於 30 個語言特徵訓練的模型達到了領先者競爭的性能,獲得了 0.9734 的 F1 分數。然而,系統性的跨領域和跨生成器評估揭示了顯著的泛化失敗:在域內表現優異的分類器在分佈轉移下顯著降級。使用基於 SHAP 的解釋,我們顯示出最具影響力的特徵在不同數據集之間有顯著差異,這表明檢測器通常依賴於特定數據集的風格線索,而非穩定的機器作者身份信號。進一步的深入錯誤分析揭示了基於語言特徵的 AI 文本檢測中的根本緊張:在域內數據上最具區分性的特徵也是最容易受到域轉移、格式變化和文本長度影響的特徵。我們相信這一知識有助於建立在不同環境中穩健的 AI 檢測器。為了支持重複性和實際使用,我們發布了一個開源的 Python 套件,該套件返回個別文本的預測和實例級解釋。

HUydra: Full-Range Lung CT Synthesis via Multiple HU Interval Generative Modelling

2603.23041v1 by António Cardoso, Pedro Sousa, Tania Pereira, Hélder P. Oliveira

Currently, a central challenge and bottleneck in the deployment and validation of computer-aided diagnosis (CAD) models within the field of medical imaging is data scarcity. For lung cancer, one of the most prevalent types worldwide, limited datasets can delay diagnosis and have an impact on patient outcome. Generative AI offers a promising solution for this issue, but dealing with the complex distribution of full Hounsfield Unit (HU) range lung CT scans is challenging and remains as a highly computationally demanding task. This paper introduces a novel decomposition strategy that synthesizes CT images one HU interval at a time, rather than modelling the entire HU domain at once. This framework focuses on training generative architectures on individual tissue-focused HU windows, then merges their output into a full-range scan via a learned reconstruction network that effectively reverses the HU-windowing process. We further propose multi-head and multi-decoder models to better capture textures while preserving anatomical consistency, with a multi-head VQVAE achieving the best performance for the generative task. Quantitative evaluation shows this approach significantly outperforms conventional 2D full-range baselines, achieving a 6.2% improvement in FID and superior MMD, Precision, and Recall across all HU intervals. The best performance is achieved by a multi-head VQVAE variant, demonstrating that it is possible to enhance visual fidelity and variability while also reducing model complexity and computational cost. This work establishes a new paradigm for structure-aware medical image synthesis, aligning generative modelling with clinical interpretation.

摘要:目前,在醫學影像領域中,計算機輔助診斷(CAD)模型的部署和驗證面臨的主要挑戰和瓶頸是數據稀缺。對於肺癌這種全球最普遍的癌症之一,有限的數據集可能會延遲診斷並影響患者的結果。生成式人工智慧為此問題提供了一個有前景的解決方案,但處理全範圍 Hounsfield 單位(HU)肺部 CT 掃描的複雜分佈是具有挑戰性的,並且仍然是一項計算需求高的任務。本文介紹了一種新穎的分解策略,逐個 HU 區間合成 CT 圖像,而不是一次性建模整個 HU 領域。該框架專注於在各個以組織為重點的 HU 窗口上訓練生成架構,然後通過學習的重建網絡將其輸出合併為全範圍掃描,該網絡有效地逆轉了 HU 窗口處理過程。我們進一步提出了多頭和多解碼器模型,以更好地捕捉紋理,同時保持解剖一致性,其中多頭 VQVAE 在生成任務中實現了最佳性能。定量評估顯示,這種方法顯著超越了傳統的 2D 全範圍基準,在所有 HU 區間中實現了 6.2% 的 FID 改進,以及優越的 MMD、精確度和召回率。最佳性能由一個多頭 VQVAE 變體實現,證明在降低模型複雜性和計算成本的同時,增強視覺真實性和變異性是可能的。這項工作為結構感知醫學影像合成建立了一種新範式,將生成建模與臨床解釋對齊。

Concept-based explanations of Segmentation and Detection models in Natural Disaster Management

2603.23020v1 by Samar Heydari, Jawher Said, Galip Ümit Yolcu, Evgenii Kortukov, Elena Golimblevskaia, Evgenios Vlachos, Vasileios Mygdalis, Ioannis Pitas, Sebastian Lapuschkin, Leila Arras

Deep learning models for flood and wildfire segmentation and object detection enable precise, real-time disaster localization when deployed on embedded drone platforms. However, in natural disaster management, the lack of transparency in their decision-making process hinders human trust required for emergency response. To address this, we present an explainability framework for understanding flood segmentation and car detection predictions on the widely used PIDNet and YOLO architectures. More specifically, we introduce a novel redistribution strategy that extends Layer-wise Relevance Propagation (LRP) explanations for sigmoid-gated element-wise fusion layers. This extension allows LRP relevances to flow through the fusion modules of PIDNet, covering the entire computation graph back to the input image. Furthermore, we apply Prototypical Concept-based Explanations (PCX) to provide both local and global explanations at the concept level, revealing which learned features drive the segmentation and detection of specific disaster semantic classes. Experiments on a publicly available flood dataset show that our framework provides reliable and interpretable explanations while maintaining near real-time inference capabilities, rendering it suitable for deployment on resource-constrained platforms, such as Unmanned Aerial Vehicles (UAVs).

摘要:深度學習模型用於洪水和野火的分割及物體檢測,當部署在嵌入式無人機平台上時,能夠實現精確的實時災害定位。然而,在自然災害管理中,其決策過程缺乏透明度,妨礙了應急響應所需的人類信任。為了解決這個問題,我們提出了一個解釋性框架,用於理解在廣泛使用的PIDNet和YOLO架構上進行的洪水分割和汽車檢測預測。更具體地說,我們引入了一種新穎的重分配策略,擴展了層級相關性傳播(LRP)解釋,適用於sigmoid閘控的逐元素融合層。這一擴展允許LRP相關性在PIDNet的融合模塊中流動,覆蓋整個計算圖,直到輸入圖像。此外,我們應用原型概念基於解釋(PCX),在概念層面提供局部和全局解釋,揭示哪些學習到的特徵驅動特定災害語義類別的分割和檢測。在一個公開可用的洪水數據集上的實驗表明,我們的框架在保持接近實時推理能力的同時,提供可靠且可解釋的解釋,使其適合部署在資源受限的平台上,例如無人機(UAV)。

On the use of Aggregation Operators to improve Human Identification using Dental Records

2603.23003v1 by Antonio D. Villegas-Yeguas, Guillermo R-García, Tzipi Kahana, Jorge Pinares Toledo, Esi Sharon, Oscar Ibañez, Oscar Cordón

The comparison of dental records is a standardized technique in forensic dentistry used to speed up the identification of individuals in multiple-comparison scenarios. Specifically, the odontogram comparison is a procedure to compute criteria that will be used to perform a ranking. State-of-the-art automatic methods either make use of simple techniques, without utilizing the full potential of the information obtained from a comparison, or their internal behavior is not known due to the lack of peer-reviewed publications. This work aims to design aggregation mechanisms to automatically compare pairs of dental records that can be understood and validated by experts, improving the current methods. To do so, we introduce different aggregation approaches using the state-of-the-art codification, based on seven different criteria. In particular, we study the performance of i) data-driven lexicographical order-based aggregations, ii) well-known fuzzy logic aggregation methods and iii) machine learning techniques as aggregation mechanisms. To validate our proposals, 215 forensic cases from two different populations have been used. The results obtained show how the use of white-box machine learning techniques as aggregation models (average ranking from 2.02 to 2.21) are able to improve the state-of-the-art (average ranking of 3.91) without compromising the explainability and interpretability of the method.

摘要:牙科紀錄的比較是一種標準化技術,在法醫牙科中用於加速在多重比較情境中識別個體。具體而言,牙齒圖比較是一種計算標準的程序,這些標準將用於進行排名。最先進的自動方法要麼使用簡單技術,未能充分利用從比較中獲得的信息,要麼其內部行為因缺乏同行評審的出版物而不為人知。本研究旨在設計聚合機制,自動比較可以被專家理解和驗證的牙科紀錄對,從而改善當前的方法。為此,我們介紹了不同的聚合方法,使用基於七個不同標準的最先進編碼。特別是,我們研究了 i) 基於數據驅動的詞典序聚合,ii) 知名的模糊邏輯聚合方法,以及 iii) 機器學習技術作為聚合機制的性能。為了驗證我們的提案,使用了來自兩個不同人群的215個法醫案例。獲得的結果顯示,使用白盒機器學習技術作為聚合模型(平均排名從2.02提高到2.21)能夠改善最先進技術(平均排名為3.91),而不妨礙方法的可解釋性和可理解性。

Ran Score: a LLM-based Evaluation Score for Radiology Report Generation

2603.22935v1 by Ran Zhang, Yucong Lin, Zhaoli Su, Bowen Liu, Danni Ai, Tianyu Fu, Deqiang Xiao, Jingfan Fan, Yuanyuan Wang, Mingwei Gao, Yuwan Hu, Shuya Gao, Jingtao Li, Jian Yang, Hong Song, Hongliang Sun

Chest X-ray report generation and automated evaluation are limited by poor recognition of low-prevalence abnormalities and inadequate handling of clinically important language, including negation and ambiguity. We develop a clinician-guided framework combining human expertise and large language models for multi-label finding extraction from free-text chest X-ray reports and use it to define Ran Score, a finding-level metric for report evaluation. Using three non-overlapping MIMIC-CXR-EN cohorts from a public chest X-ray dataset and an independent ChestX-CN validation cohort, we optimize prompts, establish radiologist-derived reference labels and evaluate report generation models. The optimized framework improves the macro-averaged score from 0.753 to 0.956 on the MIMIC-CXR-EN development cohort, exceeds the CheXbert benchmark by 15.7 percentage points on directly comparable labels, and shows robust generalization on the ChestX-CN validation cohort. Here we show that clinician-guided prompt optimization improves agreement with a radiologist-derived reference standard and that Ran Score enables finding-level evaluation of report fidelity, particularly for low-prevalence abnormalities.

摘要:胸部X光報告生成和自動評估受到低流行率異常的識別不佳以及對臨床重要語言(包括否定和模糊性)的處理不足的限制。我們開發了一個結合人類專業知識和大型語言模型的臨床指導框架,用於從自由文本胸部X光報告中提取多標籤發現,並利用它來定義Ran Score,這是一種用於報告評估的發現級別指標。使用來自公共胸部X光數據集的三個不重疊的MIMIC-CXR-EN隊列和一個獨立的ChestX-CN驗證隊列,我們優化了提示,建立了放射科醫生衍生的參考標籤,並評估報告生成模型。優化後的框架將MIMIC-CXR-EN開發隊列的宏平均分數從0.753提高到0.956,並在直接可比較的標籤上超過CheXbert基準15.7個百分點,並在ChestX-CN驗證隊列上顯示出穩健的泛化能力。在這裡,我們展示了臨床指導的提示優化改善了與放射科醫生衍生的參考標準的一致性,並且Ran Score使報告真實性的發現級別評估成為可能,特別是對於低流行率的異常情況。

Medical

Publish Date Title Authors Homepage Code
2026-04-02 Do Emotions in Prompts Matter? Effects of Emotional Framing on Large Language Models Minda Zhao et.al. 2604.02236v1 null
2026-04-02 When to ASK: Uncertainty-Gated Language Assistance for Reinforcement Learning Juarez Monteiro et.al. 2604.02226v1 null
2026-04-02 Blinded Radiologist and LLM-Based Evaluation of LLM-Generated Japanese Translations of Chest CT Reports: Comparative Study Yosuke Yamagishi et.al. 2604.02207v1 null
2026-04-02 Rare-Aware Autoencoding: Reconstructing Spatially Imbalanced Data Alejandro Castañeda Garcia et.al. 2604.02031v1 null
2026-04-02 Abnormal Head Movements in Neurological Conditions: A Knowledge-Based Dataset with Application to Cervical Dystonia Saja Al-Dabet et.al. 2604.01962v1 null
2026-04-02 Bayesian Elicitation with LLMs: Model Size Helps, Extra "Reasoning" Doesn't Always Luka Hobor et.al. 2604.01896v1 null
2026-04-02 Retrieval-aligned Tabular Foundation Models Enable Robust Clinical Risk Prediction in Electronic Health Records Under Real-world Constraints Minh-Khoi Pham et.al. 2604.01841v1 null
2026-04-02 A deep learning pipeline for PAM50 subtype classification using histopathology images and multi-objective patch selection Arezoo Borji et.al. 2604.01798v1 null
2026-04-02 Transformer self-attention encoder-decoder with multimodal deep learning for response time series forecasting and digital twin support in wind structural health monitoring Feiyu Zhou et.al. 2604.01712v1 null
2026-04-02 Development and multi-center evaluation of domain-adapted speech recognition for human-AI teaming in real-world gastrointestinal endoscopy Ruijie Yang et.al. 2604.01705v1 null
2026-04-02 Scale over Preference: The Impact of AI-Generated Content on Online Content Ecology Tianhao Shi et.al. 2604.01690v1 null
2026-04-02 Ontology-Aware Design Patterns for Clinical AI Systems: Translating Reification Theory into Software Architecture Florian Odi Stummer et.al. 2604.01661v1 null
2026-04-02 CORAL: Towards Autonomous Multi-Agent Evolution for Open-Ended Discovery Ao Qu et.al. 2604.01658v1 null
2026-04-02 NEMESIS: Noise-suppressed Efficient MAE with Enhanced Superpatch Integration Strategy Kyeonghun Kim et.al. 2604.01612v1 null
2026-04-02 Does Your Optimizer Care How You Normalize? Normalization-Optimizer Coupling in LLM Training Abdelrahman Abouzeid et.al. 2604.01563v1 null
2026-04-02 Countering Catastrophic Forgetting of Large Language Models for Better Instruction Following via Weight-Space Model Merging Mengxian Lyu et.al. 2604.01538v1 null
2026-04-02 PHMForge: A Scenario-Driven Agentic Benchmark for Industrial Asset Lifecycle Maintenance Ayan Das et.al. 2604.01532v1 null
2026-04-02 A Role-Based LLM Framework for Structured Information Extraction from Healthy Food Policies Congjing Zhang et.al. 2604.01529v1 null
2026-04-01 DISCO-TAB: A Hierarchical Reinforcement Learning Framework for Privacy-Preserving Synthesis of Complex Clinical Data Arshia Ilaty et.al. 2604.01481v1 null
2026-04-01 Low-Burden LLM-Based Preference Learning: Personalizing Assistive Robots from Natural Language Feedback for Users with Paralysis Keshav Shankar et.al. 2604.01463v1 null
2026-04-01 When AI Gets it Wong: Reliability and Risk in AI-Assisted Medication Decision Systems Khalid Adnan Alsayed et.al. 2604.01449v1 null
2026-04-01 AffordTissue: Dense Affordance Prediction for Tool-Action Specific Tissue Interaction Aiza Maksutova et.al. 2604.01371v1 null
2026-04-01 Safety, Security, and Cognitive Risks in World Models Manoj Parmar et.al. 2604.01346v1 null
2026-04-01 Regularizing Attention Scores with Bootstrapping Neo Christopher Chung et.al. 2604.01339v1 null
2026-04-01 AdaLoRA-QAT: Adaptive Low-Rank and Quantization-Aware Segmentation Prantik Deb et.al. 2604.01167v1 null
2026-04-01 Brainstacks: Cross-Domain Cognitive Capabilities via Frozen MoE-LoRA Stacks for Continual LLM Learning Mohammad R. Abu Ayyash et.al. 2604.01152v1 null
2026-04-01 PsychAgent: An Experience-Driven Lifelong Learning Agent for Self-Evolving Psychological Counselor Yutao Yang et.al. 2604.00931v2 null
2026-04-01 OkanNet: A Lightweight Deep Learning Architecture for Classification of Brain Tumor from MRI Images Okan Uçar et.al. 2604.01264v1 null
2026-04-01 BioCOMPASS: Integrating Biomarkers into Transformer-Based Immunotherapy Response Prediction Sayed Hashim et.al. 2604.00739v1 null
2026-04-01 Toward Optimal Sampling Rate Selection and Unbiased Classification for Precise Animal Activity Recognition Axiu Mao et.al. 2604.00517v1 null
2026-04-01 MAESIL: Masked Autoencoder for Enhanced Self-supervised Medical Image Learning Kyeonghun Kim et.al. 2604.00514v1 null
2026-04-01 A Reasoning-Enabled Vision-Language Foundation Model for Chest X-ray Interpretation Yabin Zhang et.al. 2604.00493v1 null
2026-04-01 Improving Generalization of Deep Learning for Brain Metastases Segmentation Across Institutions Yuchen Yang et.al. 2604.00397v1 null
2026-04-01 EvolveTool-Bench: Evaluating the Quality of LLM-Generated Tool Libraries as Software Artifacts Alibek T. Kaliyev et.al. 2604.00392v1 null
2026-03-31 Collaborative AI Agents and Critics for Fault Detection and Cause Analysis in Network Telemetry Syed Eqbal Alam et.al. 2604.00319v1 null
2026-03-31 SANA I2I: A Text Free Flow Matching Framework for Paired Image to Image Translation with a Case Study in Fetal MRI Artifact Reduction Italo Felix Santos et.al. 2604.00298v1 null
2026-03-31 A Safety-Aware Role-Orchestrated Multi-Agent LLM Framework for Behavioral Health Communication Simulation Ha Na Cho et.al. 2604.00249v1 null
2026-03-31 One Panel Does Not Fit All: Case-Adaptive Multi-Agent Deliberation for Clinical Prediction Yuxing Lu et.al. 2604.00085v1 null
2026-03-31 Physiological and Semantic Patterns in Medical Teams Using an Intelligent Tutoring System Xiaoshan Huang et.al. 2603.29950v1 null
2026-03-31 Four Generations of Quantum Biomedical Sensors Xin Jin et.al. 2603.29944v2 null
2026-03-31 ScoringBench: A Benchmark for Evaluating Tabular Foundation Models with Proper Scoring Rules Jonas Landsgesell et.al. 2603.29928v1 null
2026-03-31 Training deep learning based dynamic MR image reconstruction using synthetic fractals Anirudh Raman et.al. 2603.29922v1 null
2026-03-31 Brain MR Image Synthesis with Multi-contrast Self-attention GAN Zaid A. Abod et.al. 2604.00070v1 null
2026-03-31 Symphony for Medical Coding: A Next-Generation Agentic System for Scalable and Explainable Medical Coding Joakim Edin et.al. 2603.29709v1 null
2026-03-31 Few-shot Writer Adaptation via Multimodal In-Context Learning Tom Simon et.al. 2603.29450v1 null
2026-03-31 NeoNet: An End-to-End 3D MRI-Based Deep Learning Framework for Non-Invasive Prediction of Perineural Invasion via Generation-Driven Classification Youngung Han et.al. 2603.29449v1 null
2026-03-31 AI-Generated Prior Authorization Letters: Strong Clinical Content, Weak Administrative Scaffolding Moiz Sadiq Awan et.al. 2603.29366v1 null
2026-03-31 Predicting Neuromodulation Outcome for Parkinson's Disease with Generative Virtual Brain Model Siyuan Du et.al. 2603.29176v1 null
2026-03-31 Knowledge database development by large language models for countermeasures against viruses and marine toxins Hung N. Do et.al. 2603.29149v1 null
2026-03-31 Towards Explainable Stakeholder-Aware Requirements Prioritisation in Aged-Care Digital Health Yuqing Xiao et.al. 2603.29114v1 null
2026-03-30 A Latent Risk-Aware Machine Learning Approach for Predicting Operational Success in Clinical Trials based on TrialsBank Iness Halimi et.al. 2603.29041v1 null
2026-03-30 Human-Like Lifelong Memory: A Neuroscience-Grounded Architecture for Infinite Interaction Diego C. Lerma-Torres et.al. 2603.29023v1 null
2026-03-30 Towards a Medical AI Scientist Hongtao Wu et.al. 2603.28589v1 null
2026-03-30 Detecting low left ventricular ejection fraction from ECG using an interpretable and scalable predictor-driven framework Ya Zhou et.al. 2603.28532v1 null
2026-03-30 FeDMRA: Federated Incremental Learning with Dynamic Memory Replay Allocation Tiantian Wang et.al. 2603.28455v1 null
2026-03-30 The Scaffold Effect: How Prompt Framing Drives Apparent Multimodal Gains in Clinical VLM Evaluation Doan Nam Long Vu et.al. 2603.28387v1 null
2026-03-30 Bit-Identical Medical Deep Learning via Structured Orthogonal Initialization Yakov Pyotr Shkolnikov et.al. 2603.28040v1 null
2026-03-29 Towards Emotion Recognition with 3D Pointclouds Obtained from Facial Expression Images Laura Rayón Ropero et.al. 2603.27798v1 null
2026-03-29 RAP: Retrieve, Adapt, and Prompt-Fit for Training-Free Few-Shot Medical Image Segmentation Zhihao Mao et.al. 2603.27705v1 null
2026-03-29 Project Imaging-X: A Survey of 1000+ Open-Access Medical Imaging Datasets for Foundation Model Development Zhongying Deng et.al. 2603.27460v1 null
2026-03-28 Improving Automated Wound Assessment Using Joint Boundary Segmentation and Multi-Class Classification Models Mehedi Hasan Tusar et.al. 2603.27325v1 null
2026-03-28 Diagnosing and Repairing Unsafe Channels in Vision-Language Models via Causal Discovery and Dual-Modal Safety Subspace Projection Jinhu Fu et.al. 2603.27240v1 null
2026-03-28 MediHive: A Decentralized Agent Collective for Medical Reasoning Xiaoyang Wang et.al. 2603.27150v1 null
2026-03-28 Bayes-MICE: A Bayesian Approach to Multiple Imputation for Time Series Data Amuche Ibenegbu et.al. 2603.27142v1 null
2026-03-28 Autonomous Agent-Orchestrated Digital Twins (AADT): Leveraging the OpenClaw Framework for State Synchronization in Rare Genetic Disorders Hongzhuo Chen et.al. 2603.27104v1 null
2026-03-27 When Perplexity Lies: Generation-Focused Distillation of Hybrid Sequence Models Juan Gabriel Kostelec et.al. 2603.26556v1 null
2026-03-27 Foundation Model for Cardiac Time Series via Masked Latent Attention Moritz Vandenhirtz et.al. 2603.26475v1 null
2026-03-27 PRISMA: Toward a Normative Information Infrastructure for Responsible Pharmaceutical Knowledge Management Eugenio Rodrigo Zimmer Neves et.al. 2603.26324v1 null
2026-03-27 Progressive Learning with Anatomical Priors for Reliable Left Atrial Scar Segmentation from Late Gadolinium Enhancement MRI Jing Zhang et.al. 2603.26186v1 null
2026-03-27 SkinGPT-X: A Self-Evolving Collaborative Multi-Agent System for Transparent and Trustworthy Dermatological Diagnosis Zhangtianyi Chen et.al. 2603.26122v1 null
2026-03-27 Seeing Like Radiologists: Context- and Gaze-Guided Vision-Language Pretraining for Chest X-rays Kang Liu et.al. 2603.26049v1 null
2026-03-27 Unlabeled Cross-Center Automatic Analysis for TAAD: An Integrated Framework from Segmentation to Clinical Features Mengdi Liu et.al. 2603.26019v1 null
2026-03-27 FairLLaVA: Fairness-Aware Parameter-Efficient Fine-Tuning for Large Vision-Language Assistants Mahesh Bhosale et.al. 2603.26008v1 null
2026-03-27 Longitudinal Boundary Sharpness Coefficient Slopes Predict Time to Alzheimer's Disease Conversion in Mild Cognitive Impairment: A Survival Analysis Using the ADNI Cohort Ishaan Cherukuri et.al. 2603.26007v1 null
2026-03-26 When Chain-of-Thought Backfires: Evaluating Prompt Sensitivity in Medical Language Models Binesh Sadanandan et.al. 2603.25960v1 null
2026-03-26 Methods for Knowledge Graph Construction from Text Collections: Development and Applications Vanni Zavarella et.al. 2603.25862v1 null
2026-03-26 Doctorina MedBench: End-to-End Evaluation of Agent-Based Medical AI Anna Kozlova et.al. 2603.25821v1 null
2026-03-26 Beyond identifiability: Learning causal representations with few environments and finite samples Inbeom Lee et.al. 2603.25796v1 null
2026-03-26 DeepFAN, a transformer-based deep learning model for human-artificial intelligence collaborative assessment of incidental pulmonary nodules in CT scans: a multi-reader, multi-case trial Zhenchen Zhu et.al. 2603.25607v1 null
2026-03-26 Interpretable PM2.5 Forecasting for Urban Air Quality: A Comparative Study of Operational Time-Series Models Moazzam Umer Gondal et.al. 2603.25495v1 null
2026-03-26 Shape and Substance: Dual-Layer Side-Channel Attacks on Local Vision-Language Models Eyal Hadad et.al. 2603.25403v2 null
2026-03-26 A Causal Framework for Evaluating ICU Discharge Strategies Sagar Nagaraj Simha et.al. 2603.25397v1 null
2026-03-26 Evaluating Language Models for Harmful Manipulation Canfer Akbulut et.al. 2603.25326v2 null
2026-03-26 AD-CARE: A Guideline-grounded, Modality-agnostic LLM Agent for Real-world Alzheimer's Disease Diagnosis with Multi-cohort Assessment, Fairness Analysis, and Reader Study Wenlong Hou et.al. 2603.25322v1 null
2026-03-26 A Gait Foundation Model Predicts Multi-System Health Phenotypes from 3D Skeletal Motion Adam Gabet et.al. 2603.25283v1 null
2026-03-26 A Decade-Scale Benchmark Evaluating LLMs' Clinical Practice Guidelines Detection and Adherence in Multi-turn Conversations Andong Tan et.al. 2603.25196v1 null
2026-03-26 Empowering Epidemic Response: The Role of Reinforcement Learning in Infectious Disease Control Mutong Liu et.al. 2603.25771v1 null
2026-03-26 Photon: Speedup Volume Understanding with Efficient Multimodal Large Language Models Chengyu Fang et.al. 2603.25155v1 null
2026-03-26 Factors Influencing the Quality of AI-Generated Code: A Synthesis of Empirical Evidence Vehid Geruslu et.al. 2603.25146v1 null
2026-03-26 Rethinking Health Agents: From Siloed AI to Collaborative Decision Mediators Ray-Yuan Chung et.al. 2603.24986v1 null
2026-03-26 Subject-Specific Low-Field MRI Synthesis via a Neural Operator Ziqi Gao et.al. 2603.24968v1 null
2026-03-26 Sovereign AI at the Front Door of Care: A Physically Unidirectional Architecture for Secure Clinical Intelligence Vasu Srinivasan et.al. 2603.24898v1 null
2026-03-25 More Than "Means to an End": Supporting Reasoning with Transparently Designed AI Data Science Processes Venkatesh Sivaraman et.al. 2603.24877v1 null
2026-03-25 Reaching Beyond the Mode: RL for Distributional Reasoning in Language Models Isha Puri et.al. 2603.24844v1 null
2026-03-25 A Practical Guide Towards Interpreting Time-Series Deep Clinical Predictive Models: A Reproducibility Study Yongda Fan et.al. 2603.24828v1 null
2026-03-25 HASS: Hierarchical Simulation of Logopenic Aphasic Speech for Scalable PPA Detection Harrison Li et.al. 2603.26795v1 null
2026-03-25 PhyDCM: A Reproducible Open-Source Framework for AI-Assisted Brain Tumor Classification from Multi-Sequence MRI Hayder Saad Abdulbaqi et.al. 2603.26794v1 null
2026-03-25 Dissecting Model Failures in Abdominal Aortic Aneurysm Segmentation through Explainability-Driven Analysis Abu Noman Md Sakib et.al. 2603.24801v1 null
2026-03-25 Evaluating Fine-Tuned LLM Model For Medical Transcription With Small Low-Resource Languages Validated Dataset Mohammed Nowshad Ruhani Chowdhury et.al. 2603.24772v1 null
2026-03-25 Pseudo Label NCF for Sparse OHC Recommendation: Dual Representation Learning and the Separability Accuracy Trade off Pronob Kumar Barman et.al. 2603.24750v2 null

Abstracts

Do Emotions in Prompts Matter? Effects of Emotional Framing on Large Language Models

2604.02236v1 by Minda Zhao, Yutong Yang, Chufei Peng, Rachel Gonsalves, Weiyue Li, Ruyi Yang, Zhixi Liu, Mengyu Wang

Emotional tone is pervasive in human communication, yet its influence on large language model (LLM) behaviour remains unclear. Here, we examine how first-person emotional framing in user-side queries affect LLM performance across six benchmark domains, including mathematical reasoning, medical question answering, reading comprehension, commonsense reasoning and social inference. Across models and tasks, static emotional prefixes usually produce only small changes in accuracy, suggesting that affective phrasing is typically a mild perturbation rather than a reliable general-purpose intervention. This stability is not uniform: effects are more variable in socially grounded tasks, where emotional context more plausibly interacts with interpersonal reasoning. Additional analyses show that stronger emotional wording induces only modest extra change, and that human-written prefixes reproduce the same qualitative pattern as LLM-generated ones. We then introduce EmotionRL, an adaptive emotional prompting framework that selects emotional framing adaptively for each query. Although no single emotion is consistently beneficial, adaptive selection yields more reliable gains than fixed emotional prompting. Together, these findings show that emotional tone is neither a dominant driver of LLM performance nor irrelevant noise, but a weak and input-dependent signal that can be exploited through adaptive control.

摘要:情感語調在人類溝通中無處不在,但其對大型語言模型(LLM)行為的影響仍不明確。在這裡,我們檢視第一人稱情感框架在用戶端查詢中如何影響LLM在六個基準領域的表現,包括數學推理、醫療問答、閱讀理解、常識推理和社會推斷。在模型和任務中,靜態情感前綴通常只會產生微小的準確性變化,這表明情感措辭通常是一種輕微的擾動,而不是可靠的通用干預。這種穩定性並不均勻:在社會性基礎的任務中,效果變化更大,因為情感背景更可能與人際推理互動。額外的分析顯示,較強的情感措辭僅引發適度的額外變化,而人類撰寫的前綴重現了與LLM生成的前綴相同的質量模式。然後,我們介紹EmotionRL,一種自適應情感提示框架,根據每個查詢自適應地選擇情感框架。儘管沒有單一情感始終如一地有益,但自適應選擇比固定的情感提示產生更可靠的增益。總體而言,這些發現顯示情感語調既不是LLM表現的主導驅動力,也不是無關的噪音,而是一種微弱且依賴於輸入的信號,可以通過自適應控制來利用。

When to ASK: Uncertainty-Gated Language Assistance for Reinforcement Learning

2604.02226v1 by Juarez Monteiro, Nathan Gavenski, Gianlucca Zuin, Adriano Veloso

Reinforcement learning (RL) agents often struggle with out-of-distribution (OOD) scenarios, leading to high uncertainty and random behavior. While language models (LMs) contain valuable world knowledge, larger ones incur high computational costs, hindering real-time use, and exhibit limitations in autonomous planning. We introduce Adaptive Safety through Knowledge (ASK), which combines smaller LMs with trained RL policies to enhance OOD generalization without retraining. ASK employs Monte Carlo Dropout to assess uncertainty and queries the LM for action suggestions only when uncertainty exceeds a set threshold. This selective use preserves the efficiency of existing policies while leveraging the language model's reasoning in uncertain situations. In experiments on the FrozenLake environment, ASK shows no improvement in-domain, but demonstrates robust navigation in transfer tasks, achieving a reward of 0.95. Our findings indicate that effective neuro-symbolic integration requires careful orchestration rather than simple combination, highlighting the need for sufficient model scale and effective hybridization mechanisms for successful OOD generalization.

摘要:強化學習(RL)代理在處理分佈外(OOD)情境時常常面臨困難,導致高度的不確定性和隨機行為。雖然語言模型(LM)包含有價值的世界知識,但較大的模型會產生高計算成本,妨礙實時使用,並在自主規劃方面顯示出限制。我們引入了通過知識的自適應安全(ASK),它將較小的LM與訓練過的RL策略結合,以增強OOD泛化而無需重新訓練。ASK採用蒙特卡羅隨機失活來評估不確定性,並僅在不確定性超過設定閾值時查詢LM以獲取行動建議。這種選擇性使用保留了現有策略的效率,同時利用語言模型在不確定情況下的推理能力。在FrozenLake環境的實驗中,ASK在領域內沒有顯示出改善,但在轉移任務中顯示出穩健的導航,獲得了0.95的獎勵。我們的研究結果表明,有效的神經符號整合需要謹慎的協調,而非簡單的組合,突顯了成功的OOD泛化所需的足夠模型規模和有效的混合機制。

Blinded Radiologist and LLM-Based Evaluation of LLM-Generated Japanese Translations of Chest CT Reports: Comparative Study

2604.02207v1 by Yosuke Yamagishi, Atsushi Takamatsu, Yasunori Hamaguchi, Tomohiro Kikuchi, Shouhei Hanaoka, Takeharu Yoshikawa, Osamu Abe

Background: Accurate translation of radiology reports is important for multilingual research, clinical communication, and radiology education, but the validity of LLM-based evaluation remains unclear. Objective: To evaluate the educational suitability of LLM-generated Japanese translations of chest CT reports and compare radiologist assessments with LLM-as-a-judge evaluations. Methods: We analyzed 150 chest CT reports from the CT-RATE-JPN validation set. For each English report, a human-edited Japanese translation was compared with an LLM-generated translation by DeepSeek-V3.2. A board-certified radiologist and a radiology resident independently performed blinded pairwise evaluations across 4 criteria: terminology accuracy, readability, overall quality, and radiologist-style authenticity. In parallel, 3 LLM judges (DeepSeek-V3.2, Mistral Large 3, and GPT-5) evaluated the same pairs. Agreement was assessed using QWK and percentage agreement. Results: Agreement between radiologists and LLM judges was near zero (QWK=-0.04 to 0.15). Agreement between the 2 radiologists was also poor (QWK=0.01 to 0.06). Radiologist 1 rated terminology as equivalent in 59% of cases and favored the LLM translation for readability (51%) and overall quality (51%). Radiologist 2 rated readability as equivalent in 75% of cases and favored the human-edited translation for overall quality (40% vs 21%). All 3 LLM judges strongly favored the LLM translation across all criteria (70%-99%) and rated it as more radiologist-like in >93% of cases. Conclusions: LLM-generated translations were often judged natural and fluent, but the 2 radiologists differed substantially. LLM-as-a-judge showed strong preference for LLM output and negligible agreement with radiologists. For educational use of translated radiology reports, automated LLM-based evaluation alone is insufficient; expert radiologist review remains important.

摘要:背景:準確翻譯放射學報告對於多語言研究、臨床溝通和放射學教育至關重要,但基於大型語言模型(LLM)的評估有效性仍不清楚。目標:評估LLM生成的胸部CT報告日文翻譯的教育適用性,並比較放射科醫生的評估與LLM作為評判者的評估。方法:我們分析了來自CT-RATE-JPN驗證集的150份胸部CT報告。對於每份英文報告,將人工編輯的日文翻譯與DeepSeek-V3.2生成的翻譯進行比較。一名經過認證的放射科醫生和一名放射科住院醫師獨立進行了盲評,根據四個標準進行配對評估:術語準確性、可讀性、整體質量和放射科醫生風格的真實性。同時,三名LLM評判者(DeepSeek-V3.2、Mistral Large 3和GPT-5)對相同的配對進行評估。使用QWK和百分比一致性評估協議。結果:放射科醫生與LLM評判者之間的協議接近於零(QWK=-0.04至0.15)。兩名放射科醫生之間的協議也很差(QWK=0.01至0.06)。放射科醫生1在59%的案例中將術語評為等同,並偏好LLM翻譯的可讀性(51%)和整體質量(51%)。放射科醫生2在75%的案例中將可讀性評為等同,並偏好人工編輯的翻譯在整體質量上(40%對21%)。所有三名LLM評判者在所有標準上都強烈偏好LLM翻譯(70%-99%),並在超過93%的案例中將其評為更像放射科醫生的翻譯。結論:LLM生成的翻譯通常被評為自然流暢,但兩名放射科醫生的評價存在顯著差異。LLM作為評判者對LLM輸出表現出強烈偏好,與放射科醫生的協議微不足道。對於翻譯放射學報告的教育使用,僅依賴自動化的LLM基於評估是不夠的;專家放射科醫生的審查仍然很重要。

Rare-Aware Autoencoding: Reconstructing Spatially Imbalanced Data

2604.02031v1 by Alejandro Castañeda Garcia, Jan van Gemert, Daan Brinks, Nergis Tömen

Autoencoders can be challenged by spatially non-uniform sampling of image content. This is common in medical imaging, biology, and physics, where informative patterns occur rarely at specific image coordinates, as background dominates these locations in most samples, biasing reconstructions toward the majority appearance. In practice, autoencoders are biased toward dominant patterns resulting in the loss of fine-grained detail and causing blurred reconstructions for rare spatial inputs especially under spatial data imbalance. We address spatial imbalance by two complementary components: (i) self-entropy-based loss that upweights statistically uncommon spatial locations and (ii) Sample Propagation, a replay mechanism that selectively re-exposes the model to hard to reconstruct samples across batches during training. We benchmark existing data balancing strategies, originally developed for supervised classification, in the unsupervised reconstruction setting. Drawing on the limitations of these approaches, our method specifically targets spatial imbalance by encouraging models to focus on statistically rare locations, improving reconstruction consistency compared to existing baselines. We validate in a simulated dataset with controlled spatial imbalance conditions, and in three, uncontrolled, diverse real-world datasets spanning physical, biological, and astronomical domains. Our approach outperforms baselines on various reconstruction metrics, particularly under spatial imbalance distributions. These results highlight the importance of data representation in a batch and emphasize rare samples in unsupervised image reconstruction. We will make all code and related data available.

摘要:自編碼器在圖像內容的空間不均勻取樣方面面臨挑戰。這在醫學影像、生物學和物理學中很常見,因為在特定的圖像坐標上,資訊性模式很少出現,背景在大多數樣本中主導這些位置,導致重建偏向於主要外觀。實際上,自編碼器對主導模式存在偏見,導致細節損失,並在稀有空間輸入下造成模糊的重建,特別是在空間數據不平衡的情況下。我們通過兩個互補的組件來解決空間不平衡問題:(i) 基於自熵的損失,對統計上不常見的空間位置進行加權,以及 (ii) 樣本傳播,一種重播機制,在訓練過程中選擇性地重新暴露模型於難以重建的樣本。 我們在無監督重建環境中基準測試了原本為監督分類開發的現有數據平衡策略。基於這些方法的局限性,我們的方法專門針對空間不平衡,鼓勵模型專注於統計上稀有的位置,與現有基準相比,提高重建的一致性。我們在具有控制空間不平衡條件的模擬數據集以及三個不受控的多樣化真實世界數據集中進行驗證,這些數據集涵蓋物理、生物和天文領域。我們的方法在各種重建指標上超越了基準,特別是在空間不平衡分佈下。這些結果突顯了批次中數據表示的重要性,並強調了無監督圖像重建中稀有樣本的價值。我們將提供所有代碼和相關數據。

Abnormal Head Movements in Neurological Conditions: A Knowledge-Based Dataset with Application to Cervical Dystonia

2604.01962v1 by Saja Al-Dabet, Sherzod Turaev, Nazar Zaki

Abnormal head movements (AHMs) manifest across a broad spectrum of neurological disorders; however, the absence of a multi-condition resource integrating kinematic measurements, clinical severity scores, and patient demographics constitutes a persistent barrier to the development of AI-driven diagnostic tools. To address this gap, this study introduces NeuroPose-AHM, a knowledge-based dataset of neurologically induced AHMs constructed through a multi-LLM extraction framework applied to 1,430 peer-reviewed publications. The dataset contains 2,756 patient-group-level records spanning 57 neurological conditions, derived from 846 AHM-relevant papers. Inter-LLM reliability analysis confirms robust extraction performance, with study-level classification achieving strong agreement (kappa = 0.822). To demonstrate the dataset's analytical utility, a four-task framework is applied to cervical dystonia (CD), the condition most directly defined by pathological head movement. First, Task 1 performs multi-label AHM type classification (F1 = 0.856). Task 2 constructs the Head-Neck Severity Index (HNSI), a unified metric that normalizes heterogeneous clinical rating scales. The clinical relevance of this index is then evaluated in Task 3, where HNSI is validated against real-world CD patient data, with aligned severe-band proportions (6.7%) providing a preliminary plausibility indication for index calibration within the high severity range. Finally, Task 4 performs bridge analysis between movement-type probabilities and HNSI scores, producing significant correlations (p less than 0.001). These results demonstrate the analytical utility of NeuroPose-AHM as a structured, knowledge-based resource for neurological AHM research. The NeuroPose-AHM dataset is publicly available on Zenodo (https://doi.org/10.5281/zenodo.19386862).

摘要:異常頭部運動(AHMs)在廣泛的神經疾病中表現出來;然而,缺乏一個整合運動學測量、臨床嚴重程度評分和患者人口統計的多條件資源,構成了開發基於人工智慧的診斷工具的持續障礙。為了解決這一問題,本研究介紹了NeuroPose-AHM,這是一個基於知識的神經誘發AHMs數據集,通過應用於1,430篇經過同行評審的出版物的多LLM提取框架構建而成。該數據集包含2,756個患者群體級別的記錄,涵蓋57種神經疾病,來源於846篇與AHM相關的論文。跨LLM可靠性分析確認了穩健的提取性能,研究級別的分類達到強一致性(kappa = 0.822)。為了展示該數據集的分析效用,將四任務框架應用於頸部肌張力障礙(CD),這是由病理性頭部運動最直接定義的疾病。首先,任務1執行多標籤AHM類型分類(F1 = 0.856)。任務2構建頭頸嚴重程度指數(HNSI),這是一個統一的指標,將異質的臨床評分標準進行標準化。然後在任務3中評估該指數的臨床相關性,其中HNSI與現實世界的CD患者數據進行驗證,對應的重度比例(6.7%)為指數在高嚴重程度範圍內的校準提供了初步的合理性指示。最後,任務4在運動類型概率和HNSI分數之間進行橋接分析,產生了顯著的相關性(p小於0.001)。這些結果展示了NeuroPose-AHM作為一個結構化的、基於知識的神經AHM研究資源的分析效用。NeuroPose-AHM數據集在Zenodo上公開可用(https://doi.org/10.5281/zenodo.19386862)。

Bayesian Elicitation with LLMs: Model Size Helps, Extra "Reasoning" Doesn't Always

2604.01896v1 by Luka Hobor, Mario Brcic, Mihael Kovac, Kristijan Poje

Large language models (LLMs) have been proposed as alternatives to human experts for estimating unknown quantities with associated uncertainty, a process known as Bayesian elicitation. We test this by asking eleven LLMs to estimate population statistics, such as health prevalence rates, personality trait distributions, and labor market figures, and to express their uncertainty as 95\% credible intervals. We vary each model's reasoning effort (low, medium, high) to test whether more "thinking" improves results. Our findings reveal three key results. First, larger, more capable models produce more accurate estimates, but increasing reasoning effort provides no consistent benefit. Second, all models are severely overconfident: their 95\% intervals contain the true value only 9--44\% of the time, far below the expected 95\%. Third, a statistical recalibration technique called conformal prediction can correct this overconfidence, expanding the intervals to achieve the intended coverage. In a preliminary experiment, giving models web search access degraded predictions for already-accurate models, while modestly improving predictions for weaker ones. Models performed well on commonly discussed topics but struggled with specialized health data. These results indicate that LLM uncertainty estimates require statistical correction before they can be used in decision-making.

摘要:大型語言模型(LLMs)被提出作為人類專家在估計與不確定性相關的未知數量的替代方案,這個過程被稱為貝葉斯引導。我們通過要求十一個LLM估計人口統計數據,例如健康流行率、個性特徵分佈和勞動市場數據,並將其不確定性表達為95\%可信區間,來測試這一點。我們變化每個模型的推理努力(低、中、高)以測試更多的“思考”是否能改善結果。我們的研究結果揭示了三個關鍵結果。首先,較大、能力更強的模型產生更準確的估計,但增加推理努力並未提供一致的好處。其次,所有模型都過於自信:它們的95\%區間僅在9--44\%的情況下包含真實值,遠低於預期的95\%。第三,一種稱為符合預測的統計重新校準技術可以糾正這種過度自信,擴大區間以實現預期的覆蓋率。在一個初步實驗中,給模型提供網絡搜索訪問權限使得已經準確的模型的預測變差,而對較弱的模型則有適度的改善。模型在常見話題上表現良好,但在專門的健康數據上則掙扎。這些結果表明,LLM的不確定性估計在用於決策之前需要進行統計校正。

Retrieval-aligned Tabular Foundation Models Enable Robust Clinical Risk Prediction in Electronic Health Records Under Real-world Constraints

2604.01841v1 by Minh-Khoi Pham, Thang-Long Nguyen Ho, Thao Thi Phuong Dao, Tai Tan Mai, Minh-Triet Tran, Marie E. Ward, Una Geary, Rob Brennan, Nick McDonald, Martin Crane, Marija Bezbradica

Clinical prediction from structured electronic health records (EHRs) is challenging due to high dimensionality, heterogeneity, class imbalance, and distribution shift. While tabular in-context learning (TICL) and retrieval-augmented methods perform well on generic benchmarks, their behavior in clinical settings remains unclear. We present a multi-cohort EHR benchmark comparing classical, deep tabular, and TICL models across varying data scale, feature dimensionality, outcome rarity, and cross-cohort generalization. PFN-based TICL models are sample-efficient in low-data regimes but degrade under naive distance-based retrieval as heterogeneity and imbalance increase. We propose AWARE, a task-aligned retrieval framework using supervised embedding learning and lightweight adapters. AWARE improves AUPRC by up to 12.2% under extreme imbalance, with gains increasing with data complexity. Our results identify retrieval quality and retrieval-inference alignment as key bottlenecks for deploying tabular in-context learning in clinical prediction.

摘要:臨床預測來自結構化電子健康紀錄(EHRs)是具有挑戰性的,因為它們具有高維度性、異質性、類別不平衡和分佈轉移。雖然表格內文學習(TICL)和檢索增強方法在通用基準上表現良好,但它們在臨床環境中的行為仍不明確。我們提出了一個多隊列EHR基準,比較了傳統模型、深度表格模型和TICL模型在不同數據規模、特徵維度、結果稀有性和跨隊列泛化方面的表現。基於PFN的TICL模型在低數據環境中樣本效率高,但在異質性和不平衡性增加時,簡單的基於距離的檢索會導致性能下降。我們提出了AWARE,一個與任務對齊的檢索框架,使用監督式嵌入學習和輕量級適配器。在極端不平衡的情況下,AWARE將AUPRC提高了多達12.2%,並且隨著數據複雜性的增加而增長。我們的結果確定了檢索質量和檢索推理對齊是將表格內文學習應用於臨床預測的關鍵瓶頸。

A deep learning pipeline for PAM50 subtype classification using histopathology images and multi-objective patch selection

2604.01798v1 by Arezoo Borji, Gernot Kronreif, Bernhard Angermayr, Francisco Mario Calisto, Wolfgang Birkfellner, Inna Servetnyk, Yinyin Yuan, Sepideh Hatamikia

Breast cancer is a highly heterogeneous disease with diverse molecular profiles. The PAM50 gene signature is widely recognized as a standard for classifying breast cancer into intrinsic subtypes, enabling more personalized treatment strategies. In this study, we introduce a novel optimization-driven deep learning framework that aims to reduce reliance on costly molecular assays by directly predicting PAM50 subtypes from H&E-stained whole-slide images (WSIs). Our method jointly optimizes patch informativeness, spatial diversity, uncertainty, and patch count by combining the non-dominated sorting genetic algorithm II (NSGA-II) with Monte Carlo dropout-based uncertainty estimation. The proposed method can identify a small but highly informative patch subset for classification. We used a ResNet18 backbone for feature extraction and a custom CNN head for classification. For evaluation, we used the internal TCGA-BRCA dataset as the training cohort and the external CPTAC-BRCA dataset as the test cohort. On the internal dataset, an F1-score of 0.8812 and an AUC of 0.9841 using 627 WSIs from the TCGA-BRCA cohort were achieved. The performance of the proposed approach on the external validation dataset showed an F1-score of 0.7952 and an AUC of 0.9512. These findings indicate that the proposed optimization-guided, uncertainty-aware patch selection can achieve high performance and improve the computational efficiency of histopathology-based PAM50 classification compared to existing methods, suggesting a scalable imaging-based replacement that has the potential to support clinical decision-making.

摘要:乳腺癌是一種高度異質性的疾病,具有多樣的分子特徵。PAM50基因特徵被廣泛認可為將乳腺癌分類為內在亞型的標準,從而使得更個性化的治療策略成為可能。在本研究中,我們介紹了一種新穎的優化驅動深度學習框架,旨在通過直接從H&E染色的全切片圖像(WSIs)預測PAM50亞型,以減少對昂貴分子檢測的依賴。我們的方法通過將非支配排序遺傳算法II(NSGA-II)與基於蒙特卡羅丟棄的不確定性估計相結合,聯合優化補丁信息量、空間多樣性、不確定性和補丁數量。所提出的方法可以識別出一小部分但高度信息豐富的補丁子集進行分類。我們使用ResNet18作為特徵提取的骨幹,並使用自定義CNN頭進行分類。為了評估,我們使用內部的TCGA-BRCA數據集作為訓練隊列,並使用外部的CPTAC-BRCA數據集作為測試隊列。在內部數據集上,使用來自TCGA-BRCA隊列的627個WSIs達到了0.8812的F1分數和0.9841的AUC。所提出方法在外部驗證數據集上的表現顯示F1分數為0.7952,AUC為0.9512。這些發現表明,所提出的優化引導、不確定性感知的補丁選擇能夠實現高性能,並提高基於組織病理學的PAM50分類的計算效率,相較於現有方法,這表明了一種可擴展的基於影像的替代方案,具有支持臨床決策的潛力。

Transformer self-attention encoder-decoder with multimodal deep learning for response time series forecasting and digital twin support in wind structural health monitoring

2604.01712v1 by Feiyu Zhou, Marios Impraimakis

The wind-induced structural response forecasting capabilities of a novel transformer methodology are examined here. The model also provides a digital twin component for bridge structural health monitoring. Firstly, the approach uses the temporal characteristics of the system to train a forecasting model. Secondly, the vibration predictions are compared to the measured ones to detect large deviations. Finally, the identified cases are used as an early-warning indicator of structural change. The artificial intelligence-based model outperforms approaches for response forecasting as no assumption on wind stationarity or on structural normal vibration behavior is needed. Specifically, wind-excited dynamic behavior suffers from uncertainty related to obtaining poor predictions when the environmental or traffic conditions change. This results in a hard distinction of what constitutes normal vibration behavior. To this end, a framework is rigorously examined on real-world measurements from the Hardanger Bridge monitored by the Norwegian University of Science and Technology. The approach captures accurate structural behavior in realistic conditions, and with respect to the changes in the system excitation. The results, importantly, highlight the potential of transformer-based digital twin components to serve as next-generation tools for resilient infrastructure management, continuous learning, and adaptive monitoring over the system's lifecycle with respect to temporal characteristics.

摘要:風引起的結構反應預測能力在這裡檢驗了一種新型Transformer方法。該模型還提供了一個數字雙胞胎組件,用於橋樑結構健康監測。首先,該方法利用系統的時間特徵來訓練預測模型。其次,將振動預測與實測數據進行比較,以檢測大偏差。最後,識別出的案例用作結構變化的早期預警指標。基於人工智慧的模型在反應預測方面表現優於其他方法,因為不需要對風的穩定性或結構的正常振動行為做出假設。具體而言,風激發的動態行為受到不確定性的影響,當環境或交通條件改變時,會導致預測不佳。這使得正常振動行為的界定變得困難。為此,該框架在挪威科技大學監測的哈爾丹格橋的實際測量數據上進行了嚴格檢驗。該方法在現實條件下捕捉到準確的結構行為,並考慮到系統激勵的變化。結果重要地突顯了基於Transformer的數字雙胞胎組件作為下一代工具的潛力,用於彈性基礎設施管理、持續學習和在系統生命周期內根據時間特徵進行自適應監測。

Development and multi-center evaluation of domain-adapted speech recognition for human-AI teaming in real-world gastrointestinal endoscopy

2604.01705v1 by Ruijie Yang, Yan Zhu, Peiyao Fu, Te Luo, Zhihua Wang, Xian Yang, Quanlin Li, Pinghong Zhou, Shuo Wang

Automatic speech recognition (ASR) is a critical interface for human-AI interaction in gastrointestinal endoscopy, yet its reliability in real-world clinical settings is limited by domain-specific terminology and complex acoustic conditions. Here, we present EndoASR, a domain-adapted ASR system designed for real-time deployment in endoscopic workflows. We develop a two-stage adaptation strategy based on synthetic endoscopy reports, targeting domain-specific language modeling and noise robustness. In retrospective evaluation across six endoscopists, EndoASR substantially improves both transcription accuracy and clinical usability, reducing character error rate (CER) from 20.52% to 14.14% and increasing medical term accuracy (Med ACC) from 54.30% to 87.59%. In a prospective multi-center study spanning five independent endoscopy centers, EndoASR demonstrates consistent generalization under heterogeneous real-world conditions. Compared with the baseline Paraformer model, CER is reduced from 16.20% to 14.97%, while Med ACC is improved from 61.63% to 84.16%, confirming its robustness in practical deployment scenarios. Notably, EndoASR achieves a real-time factor (RTF) of 0.005, significantly faster than Whisper-large-v3 (RTF 0.055), while maintaining a compact model size of 220M parameters, enabling efficient edge deployment. Furthermore, integration with large language models demonstrates that improved ASR quality directly enhances downstream structured information extraction and clinician-AI interaction. These results demonstrate that domain-adapted ASR can serve as a reliable interface for human-AI teaming in gastrointestinal endoscopy, with consistent performance validated across multi-center real-world clinical settings.

摘要:自動語音識別(ASR)是人機互動中一個關鍵介面,尤其是在胃腸內視鏡檢查中,但其在現實臨床環境中的可靠性受到特定領域術語和複雜聲學條件的限制。在此,我們介紹EndoASR,一個為內視鏡工作流程實時部署而設計的領域適應ASR系統。我們基於合成內視鏡報告開發了一種兩階段適應策略,針對特定領域的語言建模和噪音穩健性。在對六位內視鏡醫生的回顧性評估中,EndoASR顯著提高了轉錄準確性和臨床可用性,將字符錯誤率(CER)從20.52%降低至14.14%,並將醫療術語準確性(Med ACC)從54.30%提高至87.59%。在一項跨越五個獨立內視鏡中心的前瞻性多中心研究中,EndoASR在異質的現實條件下顯示出一致的泛化能力。與基線Paraformer模型相比,CER從16.20%降低至14.97%,而Med ACC從61.63%提高至84.16%,確認了其在實際部署情境中的穩健性。值得注意的是,EndoASR實現了0.005的實時因子(RTF),顯著快於Whisper-large-v3(RTF 0.055),同時保持220M參數的緊湊模型大小,實現高效的邊緣部署。此外,與大型語言模型的整合顯示,改善的ASR質量直接增強了下游結構化信息提取和臨床醫生與AI的互動。這些結果表明,領域適應的ASR可以作為胃腸內視鏡中人機協作的可靠介面,其一致的性能在多中心現實臨床環境中得到了驗證。

Scale over Preference: The Impact of AI-Generated Content on Online Content Ecology

2604.01690v1 by Tianhao Shi, Yang Zhang, Xiaoyan Zhao, Fengbin Zhu, Chenyi Lei, Han Li, Wenwu Ou, Yang Song, Yongdong Zhang, Fuli Feng

The rapid proliferation of Artificial Intelligence-Generated Content (AIGC) is fundamentally restructuring online content ecologies, necessitating a rigorous examination of its behavioral and distributional implications. Leveraging a comprehensive longitudinal dataset comprising tens of millions of users from a leading Chinese video-sharing platform, this study elucidated the distinct creation and consumption behaviors characterizing AIGC versus Human-Generated Content (HGC). We identified a prevalent scale-over-preference dynamic, wherein AIGC creators achieve aggregate engagement comparable to HGC creators through high-volume production, despite a marked consumer preference for HGC. Deeper analysis uncovered the ability of the algorithmic content distribution mechanism in moderating these competing interests regarding AIGC. These findings advocated for the implementation of AIGC-sensitive distribution algorithms and precise governance frameworks to ensure the long-term health of the online content platforms.

摘要:人工智慧生成內容(AIGC)的快速增長正在根本重塑線上內容生態系統,迫切需要對其行為和分配影響進行嚴格的檢視。本研究利用來自一家領先中國視頻分享平台的數千萬用戶的綜合縱向數據集,闡明了AIGC與人類生成內容(HGC)之間的獨特創作和消費行為。我們識別出一種普遍的規模優於偏好的動態,即AIGC創作者通過高產量的生產實現了與HGC創作者相當的總體參與度,儘管消費者對HGC的偏好顯著。更深入的分析揭示了算法內容分發機制在調節這些關於AIGC的競爭利益方面的能力。這些發現提倡實施對AIGC敏感的分發算法和精確的治理框架,以確保線上內容平台的長期健康。

Ontology-Aware Design Patterns for Clinical AI Systems: Translating Reification Theory into Software Architecture

2604.01661v1 by Florian Odi Stummer

Clinical AI systems routinely train on health data structurally distorted by documentation workflows, billing incentives, and terminology fragmentation. Prior work has characterised the mechanisms of this distortion: the three-forces model of documentary enactment, the reification feedback loop through which AI may amplify coding artefacts, and terminology governance failures that allow semantic drift to accumulate. Yet translating these insights into implementable software architecture remains an open problem. This paper proposes seven ontology-aware design patterns in Gang-of-Four pattern language for building clinical AI pipelines resilient to ontological distortion. The patterns address data ingestion validation (Ontological Checkpoint), low-frequency signal preservation (Dormancy-Aware Pipeline), continuous drift monitoring (Drift Sentinel), parallel representation maintenance (Dual-Ontology Layer), feedback loop interruption (Reification Circuit Breaker), terminology evolution management (Terminology Version Gate), and pluggable regulatory compliance (Regulatory Compliance Adapter). Each pattern is specified with Problem, Forces, Solution, Consequences, Known Uses, and Related Patterns. We illustrate their composition in a reference architecture for a primary care AI system and provide a walkthrough tracing all seven patterns through a diabetes risk prediction scenario. This paper does not report empirical validation; it offers a design vocabulary grounded in theoretical analysis, subject to future evaluation in production systems. Three patterns have partial precedent in existing systems; the remaining four have not been formally described. Limitations include the absence of runtime benchmarks and restriction to the German and EU regulatory context.

摘要:臨床 AI 系統通常在受到文檔工作流程、計費激勵和術語碎片化結構性扭曲的健康數據上進行訓練。先前的研究已經描述了這種扭曲的機制:文檔執行的三力模型、AI 可能放大編碼工件的具象反饋循環,以及允許語義漂移累積的術語治理失敗。然而,將這些洞見轉化為可實施的軟體架構仍然是一個未解決的問題。本文提出了七種基於本體的設計模式,使用四人幫模式語言來構建對本體扭曲具有韌性的臨床 AI 管道。這些模式針對數據攝取驗證(本體檢查點)、低頻信號保留(休眠感知管道)、持續漂移監控(漂移哨兵)、平行表示維護(雙本體層)、反饋循環中斷(具象電路斷路器)、術語演變管理(術語版本閘)和可插拔的監管合規性(監管合規適配器)。每個模式都包含問題、力量、解決方案、後果、已知用途和相關模式的規範。我們在一個初級護理 AI 系統的參考架構中展示了它們的組合,並提供了一個通過糖尿病風險預測場景追蹤所有七種模式的步驟。本文不報告實證驗證;它提供了一個基於理論分析的設計詞彙,待未來在生產系統中進行評估。三種模式在現有系統中有部分先例;其餘四種尚未被正式描述。限制包括缺乏運行時基準和僅限於德國及歐盟的監管背景。

CORAL: Towards Autonomous Multi-Agent Evolution for Open-Ended Discovery

2604.01658v1 by Ao Qu, Han Zheng, Zijian Zhou, Yihao Yan, Yihong Tang, Shao Yong Ong, Fenglu Hong, Kaichen Zhou, Chonghe Jiang, Minwei Kong, Jiacheng Zhu, Xuan Jiang, Sirui Li, Cathy Wu, Bryan Kian Hsiang Low, Jinhua Zhao, Paul Pu Liang

Large language model (LLM)-based evolution is a promising approach for open-ended discovery, where progress requires sustained search and knowledge accumulation. Existing methods still rely heavily on fixed heuristics and hard-coded exploration rules, which limit the autonomy of LLM agents. We present CORAL, the first framework for autonomous multi-agent evolution on open-ended problems. CORAL replaces rigid control with long-running agents that explore, reflect, and collaborate through shared persistent memory, asynchronous multi-agent execution, and heartbeat-based interventions. It also provides practical safeguards, including isolated workspaces, evaluator separation, resource management, and agent session and health management. Evaluated on diverse mathematical, algorithmic, and systems optimization tasks, CORAL sets new state-of-the-art results on 10 tasks, achieving 3-10 times higher improvement rates with far fewer evaluations than fixed evolutionary search baselines across tasks. On Anthropic's kernel engineering task, four co-evolving agents improve the best known score from 1363 to 1103 cycles. Mechanistic analyses further show how these gains arise from knowledge reuse and multi-agent exploration and communication. Together, these results suggest that greater agent autonomy and multi-agent evolution can substantially improve open-ended discovery. Code is available at https://github.com/Human-Agent-Society/CORAL.

摘要:大型語言模型(LLM)基礎的演化是一種有前景的開放式發現方法,其中進展需要持續的搜索和知識積累。現有的方法仍然在很大程度上依賴於固定的啟發式和硬編碼的探索規則,這限制了LLM代理的自主性。我們提出了CORAL,這是第一個針對開放式問題的自主多代理演化框架。CORAL用長期運行的代理取代了僵化的控制,這些代理通過共享的持久記憶、異步多代理執行和基於心跳的干預進行探索、反思和合作。它還提供了實用的安全措施,包括隔離的工作空間、評估者分離、資源管理以及代理會話和健康管理。在多樣的數學、算法和系統優化任務上進行評估,CORAL在10個任務上設置了新的最先進結果,實現了3-10倍的更高改進率,並且在任務中所需的評估次數遠少於固定的演化搜索基準。在Anthropic的核心工程任務上,四個共同演化的代理將最佳已知分數從1363改善到1103個循環。機械分析進一步顯示這些增益是如何來自知識重用和多代理的探索與通信。總體而言,這些結果表明,更大的代理自主性和多代理演化可以顯著改善開放式發現。代碼可在 https://github.com/Human-Agent-Society/CORAL 獲得。

NEMESIS: Noise-suppressed Efficient MAE with Enhanced Superpatch Integration Strategy

2604.01612v1 by Kyeonghun Kim, Hyeonseok Jung, Youngung Han, Hyunsu Go, Eunseob Choi, Seongbin Park, Junsu Lim, Jiwon Yang, Sumin Lee, Insung Hwang, Ken Ying-Kai Liao, Nam-Joon Kim

Volumetric CT imaging is essential for clinical diagnosis, yet annotating 3D volumes is expensive and time-consuming, motivating self-supervised learning (SSL) from unlabeled data. However, applying SSL to 3D CT remains challenging due to the high memory cost of full-volume transformers and the anisotropic spatial structure of CT data, which is not well captured by conventional masking strategies. We propose NEMESIS, a masked autoencoder (MAE) framework that operates on local 128x128x128 superpatches, enabling memory-efficient training while preserving anatomical detail. NEMESIS introduces three key components: (i) noise-enhanced reconstruction as a pretext task, (ii) Masked Anatomical Transformer Blocks (MATB) that perform dual-masking through parallel plane-wise and axis-wise token removal, and (iii) NEMESIS Tokens (NT) for cross-scale context aggregation. On the BTCV multi-organ classification benchmark, NEMESIS with a frozen backbone and a linear classifier achieves a mean AUROC of 0.9633, surpassing fully fine-tuned SuPreM (0.9493) and VoCo (0.9387). Under a low-label regime with only 10% of available annotations, it retains an AUROC of 0.9075, demonstrating strong label efficiency. Furthermore, the superpatch-based design reduces computational cost to 31.0 GFLOPs per forward pass, compared to 985.8 GFLOPs for the full-volume baseline, providing a scalable and robust foundation for 3D medical imaging.

摘要:體積CT影像對臨床診斷至關重要,但標註3D體積既昂貴又耗時,這促使了從未標記數據中進行自我監督學習(SSL)。然而,由於全體積Transformer的高記憶體成本以及CT數據的各向異性空間結構,將SSL應用於3D CT仍然具有挑戰性,傳統的遮罩策略無法很好地捕捉這一點。我們提出了NEMESIS,一個在局部128x128x128超補丁上運行的遮罩自編碼器(MAE)框架,實現了記憶體高效的訓練,同時保留了解剖細節。NEMESIS引入了三個關鍵組件:(i)作為前置任務的噪聲增強重建,(ii)通過平行平面和軸向標記移除進行雙重遮罩的遮罩解剖Transformer塊(MATB),以及(iii)用於跨尺度上下文聚合的NEMESIS標記(NT)。在BTCV多器官分類基準上,NEMESIS與冷凍主幹和線性分類器的組合達到了0.9633的平均AUROC,超越了完全微調的SuPreM(0.9493)和VoCo(0.9387)。在僅有10%可用標註的低標籤情況下,它仍然保持0.9075的AUROC,顯示出強大的標籤效率。此外,基於超補丁的設計將每次前向傳播的計算成本降低至31.0 GFLOPs,相較於全體積基線的985.8 GFLOPs,為3D醫學影像提供了一個可擴展且穩健的基礎。

Does Your Optimizer Care How You Normalize? Normalization-Optimizer Coupling in LLM Training

2604.01563v1 by Abdelrahman Abouzeid

In LLM training, normalization layers and optimizers are typically treated as independent design choices. In a 3x2 factorial at 1B parameters and 1000 training steps, we show this assumption can fail: Dynamic Erf (Derf; Chen & Liu, 2025) suffers a large negative interaction with Muon (Jordan, 2024), with its gap to RMSNorm growing from +0.31 nats under AdamW to +0.97 under Muon, approximately three times larger. Dynamic Tanh (DyT; Zhu et al., 2025), included as a bounded-normalizer control, shows no such penalty. Our evidence points to two failure modes of erf under Muon's faster spectral-norm growth: saturation (lossy compression) and scale blindness (discarding activation magnitude). An EMA-blend that reintroduces running scale estimates recovers ~84% of the gap. Separately, reducing Derf's alpha from its published default (0.5 to 0.3) recovers ~80% by keeping erf in its near-linear regime, where it approximately preserves relative scale; this setting is not the published default of Chen & Liu (2025). Using Derf's published default alpha with Muon incurs a 0.66-nat interaction penalty without producing NaNs or divergence, making the failure easy to miss in short pilot runs.

摘要:在LLM訓練中,正規化層和優化器通常被視為獨立的設計選擇。在1B參數和1000訓練步驟的3x2因子實驗中,我們顯示這一假設可能會失效:動態誤差函數(Derf;Chen & Liu, 2025)與Muon(Jordan, 2024)之間存在較大的負交互作用,其與RMSNorm的差距從AdamW下的+0.31 nats增長至Muon下的+0.97,約大三倍。作為有界正規化控制的動態雙曲正切(DyT;Zhu et al., 2025)並未顯示出這樣的懲罰。我們的證據指向在Muon的更快光譜範數增長下,erf的兩種失效模式:飽和(有損壓縮)和尺度盲(丟棄激活幅度)。一種重新引入運行尺度估計的EMA混合方法恢復了約84%的差距。另外,將Derf的alpha從其已發表的默認值(0.5調整至0.3)恢復了約80%,因為這樣可以使erf保持在其接近線性的範疇內,並大致保持相對尺度;這一設置並不是Chen & Liu(2025)所發表的默認值。使用Derf已發表的默認alpha與Muon結合會產生0.66-nat的交互懲罰,而不會產生NaNs或發散,使得在短期試點運行中容易忽略這一失敗。

Countering Catastrophic Forgetting of Large Language Models for Better Instruction Following via Weight-Space Model Merging

2604.01538v1 by Mengxian Lyu, Cheng Peng, Ziyi Chen, Mengyuan Zhang, Jieting Li Lu, Yonghui Wu

Large language models have been adopted in the medical domain for clinical documentation to reduce clinician burden. However, studies have reported that LLMs often "forget" a significant amount of instruction-following ability when fine-tuned using a task-specific medical dataset, a critical challenge in adopting general-purpose LLMs for clinical applications. This study presents a model merging framework to efficiently adapt general-purpose LLMs to the medical domain by countering this forgetting issue. By merging a clinical foundation model (GatorTronLlama) with a general instruct model (Llama-3.1-8B-Instruct) via interpolation-based merge methods, we seek to derive a domain-adapted model with strong performance on clinical tasks while retaining instruction-following ability. Comprehensive evaluation across medical benchmarks and five clinical generation tasks (e.g., radiology and discharge summarization) shows that merged models can effectively mitigate catastrophic forgetting, preserve clinical domain expertise, and retain instruction-following ability. In addition, our model merging strategies demonstrate training efficiency, achieving performance on par with fully fine-tuned baselines under severely constrained supervision (e.g., 64-shot vs. 256-shot). Consequently, weight-space merging constitutes a highly scalable solution for adapting open-source LLMs to clinical applications, facilitating broader deployment in resource-constrained healthcare environments.

摘要:大型語言模型已在醫療領域被採用,用於臨床文檔以減輕臨床醫師的負擔。然而,研究報告指出,當使用特定任務的醫療數據集進行微調時,LLMs經常會「遺忘」大量的指令跟隨能力,這是將通用LLMs應用於臨床的關鍵挑戰。本研究提出了一種模型合併框架,以有效地將通用LLMs適應於醫療領域,通過對抗這一遺忘問題。通過將臨床基礎模型(GatorTronLlama)與通用指令模型(Llama-3.1-8B-Instruct)通過基於插值的合併方法進行合併,我們旨在推導出一個在臨床任務上表現強勁的領域適應模型,同時保留指令跟隨能力。在醫療基準和五個臨床生成任務(例如,放射學和出院摘要)的全面評估顯示,合併模型可以有效減輕災難性遺忘,保留臨床領域專業知識,並保持指令跟隨能力。此外,我們的模型合併策略展示了訓練效率,在嚴格限制的監督下(例如,64-shot對比256-shot)達到與完全微調基準相當的性能。因此,權重空間合併構成了一種高度可擴展的解決方案,用於將開源LLMs適應於臨床應用,促進在資源有限的醫療環境中的更廣泛部署。

PHMForge: A Scenario-Driven Agentic Benchmark for Industrial Asset Lifecycle Maintenance

2604.01532v1 by Ayan Das, Dhaval Patel

Large language model (LLM) agents are increasingly deployed for complex tool-orchestration tasks, yet existing benchmarks fail to capture the rigorous demands of industrial domains where incorrect decisions carry significant safety and financial consequences. To address this critical gap, we introduce PHMForge, the first comprehensive benchmark specifically designed to evaluate LLM agents on Prognostics and Health Management (PHM) tasks through realistic interactions with domain-specific MCP servers. Our benchmark encompasses 75 expert-curated scenarios spanning 7 industrial asset classes (turbofan engines, bearings, electric motors, gearboxes, aero-engines) across 5 core task categories: Remaining Useful Life (RUL) Prediction, Fault Classification, Engine Health Analysis, Cost-Benefit Analysis, and Safety/Policy Evaluation. To enable rigorous evaluation, we construct 65 specialized tools across two MCP servers and implement execution-based evaluators with task-commensurate metrics: MAE/RMSE for regression, F1-score for classification, and categorical matching for health assessments. Through extensive evaluation of leading frameworks (ReAct, Cursor Agent, Claude Code) paired with frontier LLMs (Claude Sonnet 4.0, GPT-4o, Granite-3.0-8B), we find that even top-performing configurations achieve only 68\% task completion, with systematic failures in tool orchestration (23\% incorrect sequencing), multi-asset reasoning (14.9 percentage point degradation), and cross-equipment generalization (42.7\% on held-out datasets). We open-source our complete benchmark, including scenario specifications, ground truth templates, tool implementations, and evaluation scripts, to catalyze research in agentic industrial AI.

摘要:大型語言模型(LLM)代理人越來越多地被部署於複雜的工具協調任務中,然而現有的基準無法捕捉到工業領域的嚴格需求,在這些領域中,不正確的決策會帶來重大的安全和財務後果。為了解決這一關鍵缺口,我們推出了PHMForge,這是第一個專門設計用於評估LLM代理人在預測與健康管理(PHM)任務上的綜合基準,通過與特定領域的MCP伺服器進行現實互動。我們的基準涵蓋了75個專家策劃的場景,跨越7個工業資產類別(渦扇發動機、軸承、電動馬達、齒輪箱、航空發動機),涵蓋5個核心任務類別:剩餘使用壽命(RUL)預測、故障分類、發動機健康分析、成本效益分析和安全/政策評估。為了實現嚴格的評估,我們在兩個MCP伺服器上構建了65個專門工具,並實施了基於執行的評估者,使用與任務相稱的指標:回歸的MAE/RMSE、分類的F1分數以及健康評估的類別匹配。通過對領先框架(ReAct、Cursor Agent、Claude Code)與前沿LLM(Claude Sonnet 4.0、GPT-4o、Granite-3.0-8B)的廣泛評估,我們發現即使是表現最佳的配置也僅達到68%的任務完成率,在工具協調(23%的不正確排序)、多資產推理(下降14.9個百分點)和跨設備泛化(在保留數據集上為42.7%)方面存在系統性失敗。我們開源了完整的基準,包括場景規範、真實模板、工具實現和評估腳本,以促進代理工業AI的研究。

A Role-Based LLM Framework for Structured Information Extraction from Healthy Food Policies

2604.01529v1 by Congjing Zhang, Ruoxuan Bao, Jingyu Li, Yoav Ackerman, Shuai Huang, Yanfang Su

Current Large Language Model (LLM) approaches for information extraction (IE) in the healthy food policy domain are often hindered by various factors, including misinformation, specifically hallucinations, misclassifications, and omissions that result from the structural diversity and inconsistency of policy documents. To address these limitations, this study proposes a role-based LLM framework that automates the IE from unstructured policy data by assigning specialized roles: an LLM policy analyst for metadata and mechanism classification, an LLM legal strategy specialist for identifying complex legal approaches, and an LLM food system expert for categorizing food system stages. This framework mimics expert analysis workflows by incorporating structured domain knowledge, including explicit definitions of legal mechanisms and classification criteria, into role-specific prompts. We evaluate the framework using 608 healthy food policies from the Healthy Food Policy Project (HFPP) database, comparing its performance against zero-shot, few-shot, and chain-of-thought (CoT) baselines using Llama-3.3-70B. Our proposed framework demonstrates superior performance in complex reasoning tasks, offering a reliable and transparent methodology for automating IE from health policies.

摘要:目前在健康食品政策領域中,針對信息提取(IE)的大型語言模型(LLM)方法常常受到各種因素的阻礙,包括錯誤信息,特別是幻覺、錯誤分類以及由於政策文件的結構多樣性和不一致性而導致的遺漏。為了解決這些限制,本研究提出了一個基於角色的LLM框架,通過分配專門角色來自動化從非結構化政策數據中提取信息:一個LLM政策分析師負責元數據和機制分類,一個LLM法律策略專家負責識別複雜的法律方法,以及一個LLM食品系統專家負責對食品系統階段進行分類。該框架通過將結構化的領域知識納入角色特定的提示,模仿專家分析工作流程,包括法律機制和分類標準的明確定義。我們使用來自健康食品政策項目(HFPP)數據庫的608個健康食品政策來評估該框架,並將其性能與使用Llama-3.3-70B的零樣本、少樣本和思維鏈(CoT)基準進行比較。我們提出的框架在複雜推理任務中顯示出卓越的性能,提供了一種可靠且透明的方法來自動化從健康政策中提取信息。

DISCO-TAB: A Hierarchical Reinforcement Learning Framework for Privacy-Preserving Synthesis of Complex Clinical Data

2604.01481v1 by Arshia Ilaty, Hossein Shirazi, Amir Rahmani, Hajar Homayouni

The development of robust clinical decision support systems is frequently impeded by the scarcity of high-fidelity, privacy-preserving biomedical data. While Generative Large Language Models (LLMs) offer a promising avenue for synthetic data generation, they often struggle to capture the complex, non-linear dependencies and severe class imbalances inherent in Electronic Health Records (EHR), leading to statistically plausible but clinically invalid records. To bridge this gap, we introduce DISCO-TAB (DIScriminator-guided COntrol for TABular synthesis), a novel framework that orchestrates a fine-tuned LLM with a multi-objective discriminator system optimized via Reinforcement Learning. Unlike prior methods relying on scalar feedback, DISCO-TAB evaluates synthesis at four granularities, token, sentence, feature, and row, while integrating Automated Constraint Discovery and Inverse-Frequency Reward Shaping to autonomously preserve latent medical logic and resolve minority-class collapse. We rigorously validate our framework across diverse benchmarks, including high-dimensional, small-sample medical datasets (e.g., Heart Failure, Parkinson's). Our results demonstrate that hierarchical feedback yields state-of-the-art performance, achieving up to 38.2% improvement in downstream clinical classifier utility compared to GAN and Diffusion baselines, while ensuring exceptional statistical fidelity (JSD < 0.01) and robust resistance to membership inference attacks. This work establishes a new standard for generating trustworthy, utility-preserving synthetic tabular data for sensitive healthcare applications.

摘要:臨床決策支持系統的發展常常受到高保真、隱私保護的生物醫學數據稀缺的阻礙。雖然生成大型語言模型(LLMs)為合成數據生成提供了一個有前景的途徑,但它們往往難以捕捉電子健康記錄(EHR)中固有的複雜非線性依賴關係和嚴重的類別不平衡,導致統計上合理但臨床上無效的記錄。為了彌補這一差距,我們提出了DISCO-TAB(基於判別器的表格合成控制),這是一個新穎的框架,協調了一個經過微調的LLM與通過強化學習優化的多目標判別器系統。與依賴標量反饋的先前方法不同,DISCO-TAB在四個粒度上評估合成:標記、句子、特徵和行,同時整合自動約束發現和逆頻率獎勵塑造,以自主保留潛在醫療邏輯並解決少數類別崩潰。我們在多個基準上嚴格驗證了我們的框架,包括高維度、小樣本醫療數據集(例如,心力衰竭、帕金森病)。我們的結果表明,分層反饋產生了最先進的性能,與GAN和擴散基準相比,在下游臨床分類器效用上提高了多達38.2%,同時確保了卓越的統計保真度(JSD < 0.01)以及對成員推斷攻擊的強大抵抗力。這項工作為生成值得信賴、保留效用的合成表格數據設立了新的標準,適用於敏感的醫療保健應用。

Low-Burden LLM-Based Preference Learning: Personalizing Assistive Robots from Natural Language Feedback for Users with Paralysis

2604.01463v1 by Keshav Shankar, Dan Ding, Wei Gao

Physically Assistive Robots (PARs) require personalized behaviors to ensure user safety and comfort. However, traditional preference learning methods, like exhaustive pairwise comparisons, cause severe physical and cognitive fatigue for users with profound motor impairments. To solve this, we propose a low-burden, offline framework that translates unstructured natural language feedback directly into deterministic robotic control policies. To safely bridge the gap between ambiguous human speech and robotic code, our pipeline uses Large Language Models (LLMs) grounded in the Occupational Therapy Practice Framework (OTPF). This clinical reasoning decodes subjective user reactions into explicit physical and psychological needs, which are then mapped into transparent decision trees. Before deployment, an automated "LLM-as-a-Judge" verifies the code's structural safety. We validated this system in a simulated meal preparation study with 10 adults with paralysis. Results show our natural language approach significantly reduces user workload compared to traditional baselines. Additionally, independent clinical experts confirmed the generated policies are safe and accurately reflect user preferences.

摘要:身體輔助機器人(PARs)需要個性化的行為以確保使用者的安全和舒適。然而,傳統的偏好學習方法,如全面的成對比較,會對有嚴重運動障礙的使用者造成嚴重的身體和認知疲勞。為了解決這個問題,我們提出了一個低負擔的離線框架,將非結構化的自然語言反饋直接轉換為確定性的機器人控制政策。為了安全地彌合模糊的人類語言與機器人代碼之間的差距,我們的流程使用基於職業治療實踐框架(OTPF)的大型語言模型(LLMs)。這種臨床推理將主觀的使用者反應解碼為明確的身體和心理需求,然後將其映射到透明的決策樹中。在部署之前,自動化的“LLM作為評判者”驗證代碼的結構安全性。我們在一項模擬的餐飲準備研究中驗證了這個系統,參與者為10名癱瘓成人。結果顯示,我們的自然語言方法顯著減少了使用者的工作負擔,與傳統基準相比。此外,獨立的臨床專家確認生成的政策是安全的,並準確反映了使用者的偏好。

When AI Gets it Wong: Reliability and Risk in AI-Assisted Medication Decision Systems

2604.01449v1 by Khalid Adnan Alsayed

Artificial intelligence (AI) systems are increasingly integrated into healthcare and pharmacy workflows, supporting tasks such as medication recommendations, dosage determination, and drug interaction detection. While these systems often demonstrate strong performance under standard evaluation metrics, their reliability in real-world decision-making remains insufficiently understood. In high-risk domains such as medication management, even a single incorrect recommendation can result in severe patient harm. This paper examines the reliability of AI-assisted medication systems by focusing on system failures and their potential clinical consequences. Rather than evaluating performance solely through aggregate metrics, this work shifts attention towards how errors occur and what happens when AI systems produce incorrect outputs. Through a series of controlled, simulated scenarios involving drug interactions and dosage decisions, we analyse different types of system failures, including missed interactions, incorrect risk flagging, and inappropriate dosage recommendations. The findings highlight that AI errors in medication-related contexts can lead to adverse drug reactions, ineffective treatment, or delayed care, particularly when systems are used without sufficient human oversight. Furthermore, the paper discusses the risks of over-reliance on AI recommendations and the challenges posed by limited transparency in decision-making processes. This work contributes a reliability-focused perspective on AI evaluation in healthcare, emphasising the importance of understanding failure behavior and real-world impact. It highlights the need to complement traditional performance metrics with risk-aware evaluation approaches, particularly in safety-critical domains such as pharmacy practice.

摘要:人工智慧(AI)系統越來越多地融入醫療保健和藥房工作流程中,支持藥物建議、劑量確定和藥物相互作用檢測等任務。雖然這些系統在標準評估指標下通常表現良好,但它們在現實世界決策中的可靠性仍然不足以理解。在高風險領域,如藥物管理,即使是一個錯誤的建議也可能導致患者嚴重受傷。本文通過關注系統故障及其潛在臨床後果,檢視AI輔助藥物系統的可靠性。這項工作不僅僅通過總體指標來評估性能,而是轉向關注錯誤是如何發生的以及當AI系統產生錯誤輸出時會發生什麼。通過一系列涉及藥物相互作用和劑量決策的受控模擬場景,我們分析了不同類型的系統故障,包括漏掉的相互作用、不正確的風險標記和不當的劑量建議。研究結果突顯出,AI在與藥物相關的情境中的錯誤可能導致不良藥物反應、無效治療或延遲護理,特別是在系統在缺乏足夠人類監督的情況下使用時。此外,本文還討論了對AI建議的過度依賴風險以及決策過程中透明度有限所帶來的挑戰。這項工作為醫療保健中AI評估提供了一個以可靠性為重點的視角,強調理解故障行為和現實影響的重要性。它突顯了在安全關鍵領域如藥房實踐中,將傳統性能指標與風險意識評估方法相結合的必要性。

AffordTissue: Dense Affordance Prediction for Tool-Action Specific Tissue Interaction

2604.01371v1 by Aiza Maksutova, Lalithkumar Seenivasan, Hao Ding, Jiru Xu, Chenhao Yu, Chenyan Jing, Yiqing Shen, Mathias Unberath

Surgical action automation has progressed rapidly toward achieving surgeon-like dexterous control, driven primarily by advances in learning from demonstration and vision-language-action models. While these have demonstrated success in table-top experiments, translating them to clinical deployment remains challenging: current methods offer limited predictability on where instruments will interact on tissue surfaces and lack explicit conditioning inputs to enforce tool-action-specific safe interaction regions. Addressing this gap, we introduce AffordTissue, a multimodal framework for predicting tool-action specific tissue affordance regions as dense heatmaps during cholecystectomy. Our approach combines a temporal vision encoder capturing tool motion and tissue dynamics across multiple viewpoints, language conditioning enabling generalization across diverse instrument-action pairs, and a DiT-style decoder for dense affordance prediction. We establish the first tissue affordance benchmark by curating and annotating 15,638 video clips across 103 cholecystectomy procedures, covering six unique tool-action pairs involving four instruments (hook, grasper, scissors, clipper) and their associated tasks: dissection, grasping, clipping, and cutting. Experiments demonstrate substantial improvement over vision-language model baselines (20.6 px ASSD vs. 60.2 px for Molmo-VLM), showing that our task-specific architecture outperforms large-scale foundation models for dense surgical affordance prediction. By predicting tool-action specific tissue affordance regions, AffordTissue provides explicit spatial reasoning for safe surgical automation, potentially unlocking explicit policy guidance toward appropriate tissue regions and early safe stop when instruments deviate outside predicted safe zones.

摘要:外科手術行動自動化已迅速進展,朝著實現類似外科醫生的靈巧控制邁進,主要受到從示範學習和視覺-語言-行動模型的進步驅動。雖然這些在桌面實驗中已顯示出成功,但將其轉化為臨床應用仍然具有挑戰性:當前的方法對於儀器在組織表面上的互動位置提供的預測能力有限,並且缺乏明確的條件輸入來強制執行工具-行動特定的安全互動區域。為了解決這一差距,我們介紹了AffordTissue,一個多模態框架,用於在膽囊切除術中預測工具-行動特定的組織可用性區域,並以密集熱圖的形式呈現。我們的方法結合了一個時間視覺編碼器,捕捉多個視角下的工具運動和組織動態,語言條件化使得在多樣的儀器-行動對中進行泛化,以及一個DiT風格的解碼器,用於密集的可用性預測。我們通過策劃和標註103個膽囊切除術中的15,638個視頻片段,建立了第一個組織可用性基準,涵蓋六個獨特的工具-行動對,涉及四種儀器(鉤子、抓鉗、剪刀、夾子)及其相關任務:解剖、抓取、夾持和切割。實驗顯示,相較於視覺-語言模型基準,我們的架構在密集外科可用性預測上有顯著改善(20.6 px ASSD 對比 60.2 px 的 Molmo-VLM),顯示我們的任務特定架構優於大型基礎模型。通過預測工具-行動特定的組織可用性區域,AffordTissue 為安全的外科自動化提供了明確的空間推理,潛在地解鎖了對適當組織區域的明確政策指導,並在儀器偏離預測的安全區域時及早安全停止。

Safety, Security, and Cognitive Risks in World Models

2604.01346v1 by Manoj Parmar

World models -- learned internal simulators of environment dynamics -- are rapidly becoming foundational to autonomous decision-making in robotics, autonomous vehicles, and agentic AI. Yet this predictive power introduces a distinctive set of safety, security, and cognitive risks. Adversaries can corrupt training data, poison latent representations, and exploit compounding rollout errors to cause catastrophic failures in safety-critical deployments. World model-equipped agents are more capable of goal misgeneralisation, deceptive alignment, and reward hacking precisely because they can simulate the consequences of their own actions. Authoritative world model predictions further foster automation bias and miscalibrated human trust that operators lack the tools to audit. This paper surveys the world model landscape; introduces formal definitions of trajectory persistence and representational risk; presents a five-profile attacker capability taxonomy; and develops a unified threat model extending MITRE ATLAS and the OWASP LLM Top 10 to the world model stack. We provide an empirical proof-of-concept on trajectory-persistent adversarial attacks (GRU-RSSM: A_1 = 2.26x amplification, -59.5% reduction under adversarial fine-tuning; stochastic RSSM proxy: A_1 = 0.65x; DreamerV3 checkpoint: non-zero action drift confirmed). We illustrate risks through four deployment scenarios and propose interdisciplinary mitigations spanning adversarial hardening, alignment engineering, NIST AI RMF and EU AI Act governance, and human-factors design. We argue that world models must be treated as safety-critical infrastructure requiring the same rigour as flight-control software or medical devices.

摘要:世界模型——學習的環境動態內部模擬器——正迅速成為機器人、自主車輛和自主人工智慧中自主決策的基礎。然而,這種預測能力引入了一組獨特的安全性、保安性和認知風險。對手可以腐敗訓練數據、毒害潛在表示,並利用累積的展開錯誤來造成在安全關鍵部署中的災難性失敗。配備世界模型的代理更容易出現目標誤泛化、欺騙性對齊和獎勵駭客,正因為它們能夠模擬自身行動的後果。權威的世界模型預測進一步助長了自動化偏見和不當校準的人類信任,操作員缺乏審計工具。
本文調查了世界模型的現狀;引入了軌跡持續性和表示風險的正式定義;提出了五種攻擊者能力分類法;並發展了一個統一的威脅模型,將MITRE ATLAS和OWASP LLM前10名擴展到世界模型堆棧。我們提供了一個關於軌跡持續性對抗攻擊的實證概念驗證(GRU-RSSM: A_1 = 2.26倍增強,對抗微調下減少59.5%;隨機RSSM代理: A_1 = 0.65倍;DreamerV3檢查點: 確認非零行動漂移)。我們通過四個部署場景說明了風險,並提出了跨學科的緩解措施,涵蓋對抗加固、對齊工程、NIST AI RMF和EU AI法案治理,以及人因設計。我們認為,世界模型必須被視為安全關鍵基礎設施,需遵循與飛行控制軟件或醫療設備相同的嚴謹性。

Regularizing Attention Scores with Bootstrapping

2604.01339v1 by Neo Christopher Chung, Maxim Laletin

Vision transformers (ViT) rely on attention mechanism to weigh input features, and therefore attention scores have naturally been considered as explanations for its decision-making process. However, attention scores are almost always non-zero, resulting in noisy and diffused attention maps and limiting interpretability. Can we quantify uncertainty measures of attention scores and obtain regularized attention scores? To this end, we consider attention scores of ViT in a statistical framework where independent noise would lead to insignificant yet non-zero scores. Leveraging statistical learning techniques, we introduce the bootstrapping for attention scores which generates a baseline distribution of attention scores by resampling input features. Such a bootstrap distribution is then used to estimate significances and posterior probabilities of attention scores. In natural and medical images, the proposed \emph{Attention Regularization} approach demonstrates a straightforward removal of spurious attention arising from noise, drastically improving shrinkage and sparsity. Quantitative evaluations are conducted using both simulation and real-world datasets. Our study highlights bootstrapping as a practical regularization tool when using attention scores as explanations for ViT. Code available: https://github.com/ncchung/AttentionRegularization

摘要:視覺轉換器(ViT)依賴注意力機制來加權輸入特徵,因此注意力分數自然被視為其決策過程的解釋。然而,注意力分數幾乎總是非零的,這導致了噪聲和擴散的注意力圖,限制了可解釋性。我們能否量化注意力分數的不確定性度量並獲得正則化的注意力分數?為此,我們在統計框架中考慮ViT的注意力分數,其中獨立噪聲會導致不重要但非零的分數。利用統計學習技術,我們引入了注意力分數的自助法,通過重新抽樣輸入特徵生成注意力分數的基準分佈。這樣的自助分佈隨後用於估計注意力分數的顯著性和後驗概率。在自然和醫學圖像中,所提出的\emph{注意力正則化}方法展示了直接去除由噪聲引起的虛假注意力,顯著改善了收縮性和稀疏性。定量評估使用模擬和現實世界數據集進行。我們的研究強調自助法作為使用注意力分數作為ViT解釋的實用正則化工具。
代碼可用:https://github.com/ncchung/AttentionRegularization

AdaLoRA-QAT: Adaptive Low-Rank and Quantization-Aware Segmentation

2604.01167v1 by Prantik Deb, Srimanth Dhondy, N. Ramakrishna, Anu Kapoor, Raju S. Bapi, Tapabrata Chakraborti

Chest X-ray (CXR) segmentation is an important step in computer-aided diagnosis, yet deploying large foundation models in clinical settings remains challenging due to computational constraints. We propose AdaLoRA-QAT, a two-stage fine-tuning framework that combines adaptive low-rank encoder adaptation with full quantization-aware training. Adaptive rank allocation improves parameter efficiency, while selective mixed-precision INT8 quantization preserves structural fidelity crucial for clinical reliability. Evaluated across large-scale CXR datasets, AdaLoRA-QAT achieves 95.6% Dice, matching full-precision SAM decoder fine-tuning while reducing trainable parameters by 16.6\times and yielding 2.24\times model compression. A Wilcoxon signed-rank test confirms that quantization does not significantly degrade segmentation accuracy. These results demonstrate that AdaLoRA-QAT effectively balances accuracy, efficiency, and structural trust-worthiness, enabling compact and deployable foundation models for medical image segmentation. Code and pretrained models are available at: https://prantik-pdeb.github.io/adaloraqat.github.io/

摘要:胸部X光(CXR)分割是電腦輔助診斷中的一個重要步驟,但由於計算限制,在臨床環境中部署大型基礎模型仍然具有挑戰性。我們提出AdaLoRA-QAT,一個兩階段的微調框架,結合了自適應低秩編碼器適應和全面量化感知訓練。自適應秩分配提高了參數效率,而選擇性混合精度INT8量化則保留了對臨床可靠性至關重要的結構保真度。在大規模CXR數據集上評估,AdaLoRA-QAT達到了95.6%的Dice,與全精度SAM解碼器微調相匹配,同時將可訓練參數減少了16.6\times,並實現了2.24\times的模型壓縮。威爾科克森符號秩檢驗確認量化並未顯著降低分割準確性。這些結果表明,AdaLoRA-QAT有效地平衡了準確性、效率和結構可信度,使得醫學影像分割的緊湊和可部署的基礎模型成為可能。代碼和預訓練模型可在以下網址獲得:https://prantik-pdeb.github.io/adaloraqat.github.io/

Brainstacks: Cross-Domain Cognitive Capabilities via Frozen MoE-LoRA Stacks for Continual LLM Learning

2604.01152v1 by Mohammad R. Abu Ayyash

We present Brainstacks, a modular architecture for continual multi-domain fine-tuning of large language models that packages domain expertise as frozen adapter stacks composing additively on a shared frozen base at inference. Five interlocking components: (1) MoE-LoRA with Shazeer-style noisy top-2 routing across all seven transformer projections under QLoRA 4-bit quantization with rsLoRA scaling; (2) an inner loop performing residual boosting by freezing trained stacks and adding new ones; (3) an outer loop training sequential domain-specific stacks with curriculum-ordered dependencies; (4) null-space projection via randomized SVD constraining new stacks to subspaces orthogonal to prior directions, achieving zero forgetting in isolation; (5) an outcome-based sigmoid meta-router trained on empirically discovered domain-combination targets that selectively weights stacks, enabling cross-domain composition. Two boundary experiments: (6) PSN pretraining on a randomly initialized model; (7) per-domain RL (DPO/GRPO) validating compatibility with post-SFT alignment. Validated on TinyLlama-1.1B (4 domains, 9 stacks) and Gemma 3 12B IT (5 domains, 10 stacks), MoE-LoRA achieves 2.5x faster convergence than parameter-matched single LoRA, residual boosting breaks through the single-stack ceiling, and the routed system recovers generation quality destroyed by ungated stack accumulation. The central finding: the outcome-based router discovers that domain stacks encode transferable cognitive primitives (instruction-following clarity, numerical reasoning, procedural logic, chain-of-thought structure) rather than domain-specific knowledge, with medical prompts routing to chat+math stacks in 97% of cases despite zero medical data in those stacks.

摘要:我們提出了 Brainstacks,一種模組化架構,用於大型語言模型的持續多領域微調,將領域專業知識打包為凍結的適配器堆疊,這些堆疊在推理時在共享的凍結基礎上進行加法組合。五個相互交織的組件:(1) MoE-LoRA,使用 Shazeer 風格的噪聲 top-2 路由,跨越所有七個Transformer投影,在 QLoRA 4 位量化下,並使用 rsLoRA 縮放;(2) 內部循環通過凍結訓練堆疊並添加新的堆疊來執行殘差增強;(3) 外部循環訓練具有課程排序依賴關係的序列領域特定堆疊;(4) 通過隨機 SVD 的零空間投影,將新的堆疊約束到與先前方向正交的子空間,實現零遺忘;(5) 基於結果的 sigmoid 元路由器,根據經驗發現的領域組合目標進行訓練,選擇性地加權堆疊,使跨領域組合成為可能。兩個邊界實驗:(6) 在隨機初始化的模型上進行 PSN 預訓練;(7) 每個領域的強化學習(DPO/GRPO)驗證與後 SFT 對齊的兼容性。在 TinyLlama-1.1B(4 個領域,9 個堆疊)和 Gemma 3 12B IT(5 個領域,10 個堆疊)上進行驗證,MoE-LoRA 的收斂速度比參數匹配的單一 LoRA 快 2.5 倍,殘差增強突破了單堆疊的天花板,路由系統恢復了因無閘堆疊累積而損失的生成質量。核心發現:基於結果的路由器發現領域堆疊編碼了可轉移的認知原語(遵循指令的清晰度、數字推理、程序邏輯、思維鏈結構),而不是特定於領域的知識,儘管這些堆疊中沒有醫療數據,但醫療提示在 97% 的情況下路由到 chat+math 堆疊。

PsychAgent: An Experience-Driven Lifelong Learning Agent for Self-Evolving Psychological Counselor

2604.00931v2 by Yutao Yang, Junsong Li, Qianjun Pan, Jie Zhou, Kai Chen, Qin Chen, Jingyuan Zhao, Ningning Zhou, Xin Li, Liang He

Existing methods for AI psychological counselors predominantly rely on supervised fine-tuning using static dialogue datasets. However, this contrasts with human experts, who continuously refine their proficiency through clinical practice and accumulated experience. To bridge this gap, we propose an Experience-Driven Lifelong Learning Agent (\texttt{PsychAgent}) for psychological counseling. First, we establish a Memory-Augmented Planning Engine tailored for longitudinal multi-session interactions, which ensures therapeutic continuity through persistent memory and strategic planning. Second, to support self-evolution, we design a Skill Evolution Engine that extracts new practice-grounded skills from historical counseling trajectories. Finally, we introduce a Reinforced Internalization Engine that integrates the evolved skills into the model via rejection fine-tuning, aiming to improve performance across diverse scenarios. Comparative analysis shows that our approach achieves higher scores than strong general LLMs (e.g., GPT-5.4, Gemini-3) and domain-specific baselines across all reported evaluation dimensions. These results suggest that lifelong learning can improve the consistency and overall quality of multi-session counseling responses.

摘要:現有的AI心理諮詢師方法主要依賴於使用靜態對話數據集的監督微調。然而,這與人類專家形成對比,人類專家通過臨床實踐和積累的經驗不斷提升自己的專業能力。為了彌補這一差距,我們提出了一個以經驗驅動的終身學習代理(\texttt{PsychAgent})用於心理諮詢。首先,我們建立了一個針對長期多次會話互動的記憶增強規劃引擎,這確保了通過持久記憶和戰略規劃實現治療的連續性。其次,為了支持自我演化,我們設計了一個技能演化引擎,從歷史諮詢軌跡中提取基於實踐的新技能。最後,我們引入了一個強化內化引擎,通過拒絕微調將演化的技能整合到模型中,旨在提高在各種情境下的表現。比較分析顯示,我們的方法在所有報告的評估維度上都達到了比強大的通用LLM(例如,GPT-5.4、Gemini-3)和特定領域基準更高的分數。這些結果表明,終身學習可以提高多次會話諮詢回應的一致性和整體質量。

OkanNet: A Lightweight Deep Learning Architecture for Classification of Brain Tumor from MRI Images

2604.01264v1 by Okan Uçar, Murat Kurt

Medical imaging techniques, especially Magnetic Resonance Imaging (MRI), are accepted as the gold standard in the diagnosis and treatment planning of neurological diseases. However, the manual analysis of MRI images is a time-consuming process for radiologists and is prone to human error due to fatigue. In this study, two different Deep Learning approaches were developed and analyzed comparatively for the automatic detection and classification of brain tumors (Glioma, Meningioma, Pituitary, and No Tumor). In the first approach, a custom Convolutional Neural Network (CNN) architecture named "OkanNet", which has a low computational cost and fast training time, was designed from scratch. In the second approach, the Transfer Learning method was applied using the 50-layer ResNet-50 [1] architecture, pre-trained on the ImageNet dataset. In experiments conducted on an extended dataset compiled by Masoud Nickparvar containing a total of $7,023$ MRI images, the Transfer Learning-based ResNet-50 model exhibited superior classification performance, achieving $96.49\%$ Accuracy and $0.963$ Precision. In contrast, the custom OkanNet architecture reached an accuracy rate of $88.10\%$; however, it proved to be a strong alternative for mobile and embedded systems with limited computational power by yielding results approximately $3.2$ times faster ($311$ seconds) than ResNet-50 in terms of training time. This study demonstrates the trade-off between model depth and computational efficiency in medical image analysis through experimental data.

摘要:醫學影像技術,特別是磁共振成像(MRI),被認為是神經疾病診斷和治療計劃的金標準。然而,MRI影像的手動分析對放射科醫生來說是一個耗時的過程,並且由於疲勞容易出現人為錯誤。在這項研究中,開發並比較分析了兩種不同的深度學習方法,用於自動檢測和分類腦腫瘤(膠質瘤、腦膜瘤、腦垂體瘤和無腫瘤)。在第一種方法中,從零開始設計了一種名為“OkanNet”的自定義卷積神經網絡(CNN)架構,其計算成本低且訓練時間快。在第二種方法中,應用了轉移學習方法,使用了在ImageNet數據集上預訓練的50層ResNet-50 [1]架構。在由Masoud Nickparvar編輯的擴展數據集上進行的實驗中,該數據集包含總共$7,023$幅MRI影像,基於轉移學習的ResNet-50模型顯示出優越的分類性能,達到$96.49\%$的準確率和$0.963$的精確度。相比之下,自定義的OkanNet架構達到了$88.10\%$的準確率;然而,它在訓練時間上比ResNet-50快約$3.2$倍($311$秒),證明它對計算能力有限的移動和嵌入式系統是一個強有力的替代方案。這項研究通過實驗數據展示了醫學影像分析中模型深度與計算效率之間的權衡。

BioCOMPASS: Integrating Biomarkers into Transformer-Based Immunotherapy Response Prediction

2604.00739v1 by Sayed Hashim, Frank Soboczenski, Paul Cairns

Datasets used in immunotherapy response prediction are typically small in size, as well as diverse in cancer type, drug administered, and sequencer used. Models often drop in performance when tested on patient cohorts that are not included in the training process. Recent work has shown that transformer-based models along with self-supervised learning show better generalisation performance than threshold-based biomarkers, but is still suboptimal. We present BioCOMPASS, an extension of a transformer-based model called COMPASS, that integrates biomarkers and treatment information to further improve its generalisability. Instead of feeding biomarker data as input, we built loss components to align them with the model's intermediate representations. We found that components such as treatment gating and pathway consistency loss improved generalisability when evaluated with Leave-one-cohort-out, Leave-one-cancer-type-out and Leave-one-treatment-out strategies. Results show that building components that exploit biomarker and treatment information can help in generalisability of immunotherapy response prediction. Careful curation of additional components that leverage complementary clinical information and domain knowledge represents a promising direction for future research.

摘要:使用於免疫療法反應預測的數據集通常規模較小,且在癌症類型、所施用的藥物和使用的測序儀器上具有多樣性。當在未包含於訓練過程中的患者群體上進行測試時,模型的性能往往會下降。最近的研究顯示,基於Transformer的模型結合自我監督學習在泛化性能上優於基於閾值的生物標記,但仍然不夠理想。我們提出了BioCOMPASS,這是一個基於Transformer的模型COMPASS的擴展,整合了生物標記和治療信息以進一步提高其泛化能力。我們並不是將生物標記數據作為輸入,而是構建了損失組件以使其與模型的中間表示對齊。我們發現,治療閘控和通路一致性損失等組件在使用Leave-one-cohort-out、Leave-one-cancer-type-out和Leave-one-treatment-out策略進行評估時,提高了泛化能力。結果顯示,構建利用生物標記和治療信息的組件可以幫助提高免疫療法反應預測的泛化能力。仔細策劃利用互補臨床信息和領域知識的額外組件,代表了未來研究的一個有前景的方向。

Toward Optimal Sampling Rate Selection and Unbiased Classification for Precise Animal Activity Recognition

2604.00517v1 by Axiu Mao, Meilu Zhu, Lei Shen, Xiaoshuai Wang, Tomas Norton, Kai Liu

With the rapid advancements in deep learning techniques, wearable sensor-aided animal activity recognition (AAR) has demonstrated promising performance, thereby improving livestock management efficiency as well as animal health and welfare monitoring. However, existing research often prioritizes overall performance, overlooking the fact that classification accuracies for specific animal behavioral categories may remain unsatisfactory. This issue typically stems from suboptimal sampling rates or class imbalance problems. To address these challenges and achieve high classification accuracy across all individual behaviors in farm animals, we propose a novel Individual-Behavior-Aware Network (IBA-Net). This network enhances the recognition of each specific behavior by simultaneously customizing features and calibrating the classifier. Specifically, considering that different behaviors require varying sampling rates to achieve optimal performance, we design a Mixture-of-Experts (MoE)-based Feature Customization (MFC) module. This module adaptively fuses data from multiple sampling rates, capturing customized features tailored to various animal behaviors. Additionally, to mitigate classifier bias toward majority classes caused by class imbalance, we develop a Neural Collapse-driven Classifier Calibration (NC3) module. This module introduces a fixed equiangular tight frame (ETF) classifier during the classification stage, maximizing the angles between pair-wise classifier vectors and thereby improving the classification performance for minority classes. To validate the effectiveness of IBA-Net, we conducted experiments on three public datasets covering goat, cattle, and horse activity recognition. The results demonstrate that our method consistently outperforms existing approaches across all datasets.

摘要:隨著深度學習技術的快速進步,穿戴式傳感器輔助的動物活動識別(AAR)顯示出良好的性能,從而提高了畜牧管理效率以及動物健康和福利監測。然而,現有研究通常優先考慮整體性能,忽視了特定動物行為類別的分類準確率可能仍然不令人滿意。這個問題通常源於次優的取樣率或類別不平衡問題。為了解決這些挑戰並在農場動物的所有個體行為中實現高分類準確率,我們提出了一種新穎的個體行為感知網絡(IBA-Net)。該網絡通過同時定制特徵和校準分類器來增強對每種特定行為的識別。具體而言,考慮到不同的行為需要不同的取樣率以實現最佳性能,我們設計了一個基於專家混合(MoE)的特徵定制(MFC)模塊。該模塊自適應地融合來自多個取樣率的數據,捕捉針對各種動物行為量身定制的特徵。此外,為了減輕由於類別不平衡而導致的分類器對多數類別的偏見,我們開發了一個基於神經崩潰的分類器校準(NC3)模塊。該模塊在分類階段引入了一個固定的等角緊框(ETF)分類器,最大化成對分類器向量之間的角度,從而提高對少數類別的分類性能。為了驗證IBA-Net的有效性,我們在涵蓋山羊、牛和馬活動識別的三個公共數據集上進行了實驗。結果表明,我們的方法在所有數據集上始終優於現有方法。

MAESIL: Masked Autoencoder for Enhanced Self-supervised Medical Image Learning

2604.00514v1 by Kyeonghun Kim, Hyeonseok Jung, Youngung Han, Junsu Lim, YeonJu Jean, Seongbin Park, Eunseob Choi, Hyunsu Go, SeoYoung Ju, Seohyoung Park, Gyeongmin Kim, MinJu Kwon, KyungSeok Yuh, Soo Yong Kim, Ken Ying-Kai Liao, Nam-Joon Kim, Hyuk-Jae Lee

Training deep learning models for three-dimensional (3D) medical imaging, such as Computed Tomography (CT), is fundamentally challenged by the scarcity of labeled data. While pre-training on natural images is common, it results in a significant domain shift, limiting performance. Self-Supervised Learning (SSL) on unlabeled medical data has emerged as a powerful solution, but prominent frameworks often fail to exploit the inherent 3D nature of CT scans. These methods typically process 3D scans as a collection of independent 2D slices, an approach that fundamentally discards critical axial coherence and the 3D structural context. To address this limitation, we propose the autoencoder for enhanced self-supervised medical image learning(MAESIL), a novel self-supervised learning framework designed to capture 3D structural information efficiently. The core innovation is the 'superpatch', a 3D chunk-based input unit that balances 3D context preservation with computational efficiency. Our framework partitions the volume into superpatches and employs a 3D masked autoencoder strategy with a dual-masking strategy to learn comprehensive spatial representations. We validated our approach on three diverse large-scale public CT datasets. Our experimental results show that MAESIL demonstrates significant improvements over existing methods such as AE, VAE and VQ-VAE in key reconstruction metrics such as PSNR and SSIM. This establishes MAESIL as a robust and practical pre-training solution for 3D medical imaging tasks.

摘要:訓練三維 (3D) 醫學影像的深度學習模型,例如計算機斷層掃描 (CT),基本上受到標記數據稀缺的挑戰。雖然在自然影像上進行預訓練是常見的做法,但這會導致顯著的領域轉移,限制了性能。對未標記醫學數據進行自我監督學習 (SSL) 已經成為一種強大的解決方案,但現有的主要框架往往未能充分利用 CT 掃描的固有 3D 特性。這些方法通常將 3D 掃描處理為獨立的 2D 切片集合,這種方法基本上忽略了關鍵的軸向一致性和 3D 結構上下文。為了解決這一限制,我們提出了增強自我監督醫學影像學習的自編碼器 (MAESIL),這是一個旨在高效捕捉 3D 結構信息的新型自我監督學習框架。核心創新是「超補丁」,這是一種基於 3D 塊的輸入單元,平衡了 3D 上下文的保留與計算效率。我們的框架將體積劃分為超補丁,並採用 3D 掩碼自編碼器策略,結合雙重掩碼策略來學習全面的空間表示。我們在三個多樣化的大型公共 CT 數據集上驗證了我們的方法。我們的實驗結果顯示,MAESIL 在關鍵重建指標如 PSNR 和 SSIM 上顯示出相較於現有方法如 AE、VAE 和 VQ-VAE 的顯著改進。這使 MAESIL 成為 3D 醫學影像任務的穩健且實用的預訓練解決方案。

A Reasoning-Enabled Vision-Language Foundation Model for Chest X-ray Interpretation

2604.00493v1 by Yabin Zhang, Chong Wang, Yunhe Gao, Jiaming Liu, Maya Varma, Justin Xu, Sophie Ostmeier, Jin Long, Sergios Gatidis, Seena Dehkharghani, Arne Michalson, Eun Kyoung Hong, Christian Bluethgen, Haiwei Henry Guo, Alexander Victor Ortiz, Stephan Altmayer, Sandhya Bodapati, Joseph David Janizek, Ken Chang, Jean-Benoit Delbrouck, Akshay S. Chaudhari, Curtis P. Langlotz

Chest X-rays (CXRs) are among the most frequently performed imaging examinations worldwide, yet rising imaging volumes increase radiologist workload and the risk of diagnostic errors. Although artificial intelligence (AI) systems have shown promise for CXR interpretation, most generate only final predictions, without making explicit how visual evidence is translated into radiographic findings and diagnostic predictions. We present CheXOne, a reasoning-enabled vision-language model for CXR interpretation. CheXOne jointly generates diagnostic predictions and explicit, clinically grounded reasoning traces that connect visual evidence, radiographic findings, and these predictions. The model is trained on 14.7 million instruction and reasoning samples curated from 30 public datasets spanning 36 CXR interpretation tasks, using a two-stage framework that combines instruction tuning with reinforcement learning to improve reasoning quality. We evaluate CheXOne in zero-shot settings across visual question answering, report generation, visual grounding and reasoning assessment, covering 17 evaluation settings. CheXOne outperforms existing medical and general-domain foundation models and achieves strong performance on independent public benchmarks. A clinical reader study demonstrates that CheXOne-drafted reports are comparable to or better than resident-written reports in 55% of cases, while effectively addressing clinical indications and enhancing both report writing and CXR interpretation efficiency. Further analyses involving radiologists reveal that the generated reasoning traces show high clinical factuality and provide causal support for the final predictions, offering a plausible explanation for the performance gains. These results suggest that explicit reasoning can improve model performance, interpretability and clinical utility in AI-assisted CXR interpretation.

摘要:胸部X光檢查(CXRs)是全球最常執行的影像檢查之一,但不斷增加的影像量增加了放射科醫師的工作負擔和診斷錯誤的風險。儘管人工智慧(AI)系統在CXR解讀方面顯示出潛力,但大多數僅生成最終預測,而未明確說明視覺證據如何轉化為放射學發現和診斷預測。我們提出了CheXOne,一種具備推理能力的視覺-語言模型,用於CXR解讀。CheXOne共同生成診斷預測和明確的、臨床基礎的推理痕跡,將視覺證據、放射學發現和這些預測連結起來。該模型在從30個公共數據集中策劃的1470萬個指令和推理樣本上進行訓練,涵蓋36個CXR解讀任務,使用一種結合指令調整和強化學習的兩階段框架,以提高推理質量。我們在零樣本設置下評估CheXOne,涵蓋視覺問題回答、報告生成、視覺定位和推理評估,共涉及17個評估設置。CheXOne在現有的醫學和通用領域基礎模型中表現優於其他模型,並在獨立公共基準上取得了良好的表現。一項臨床讀者研究顯示,CheXOne撰寫的報告在55%的案例中與住院醫師撰寫的報告相當或更好,同時有效地解決臨床指徵,並提高報告撰寫和CXR解讀的效率。進一步的分析涉及放射科醫師,顯示生成的推理痕跡具有高臨床事實性,並為最終預測提供因果支持,為性能提升提供了合理的解釋。這些結果表明,明確的推理可以改善模型性能、可解釋性和在AI輔助CXR解讀中的臨床實用性。

Improving Generalization of Deep Learning for Brain Metastases Segmentation Across Institutions

2604.00397v1 by Yuchen Yang, Shuangyang Zhong, Haijun Yu, Langcuomu Suo, Hongbin Han, Florian Putz, Yixing Huang

Background: Deep learning has demonstrated significant potential for automated brain metastases (BM) segmentation; however, models trained at a singular institution often exhibit suboptimal performance at various sites due to disparities in scanner hardware, imaging protocols, and patient demographics. The goal of this work is to create a domain adaptation framework that will allow for BM segmentation to be used across multiple institutions. Methods: We propose a VAE-MMD preprocessing pipeline that combines variational autoencoders (VAE) with maximum mean discrepancy (MMD) loss, incorporating skip connections and self-attention mechanisms alongside nnU-Net segmentation. The method was tested on 740 patients from four public databases: Stanford, UCSF, UCLM, and PKG, evaluated by domain classifier's accuracy, sensitivity, precision, F1/F2 scores, surface Dice (sDice), and 95th percentile Hausdorff distance (HD95). Results: VAE-MMD reduced domain classifier accuracy from 0.91 to 0.50, indicating successful feature alignment across institutions. Reconstructed volumes attained a PSNR greater than 36 dB, maintaining anatomical accuracy. The combined method raised the mean F1 by 11.1% (0.700 to 0.778), the mean sDice by 7.93% (0.7121 to 0.7686), and reduced the mean HD95 by 65.5% (11.33 to 3.91 mm) across all four centers compared to the baseline nnU-Net. Conclusions: VAE-MMD effectively diminishes cross-institutional data heterogeneity and enhances BM segmentation generalization across volumetric, detection, and boundary-level metrics without necessitating target-domain labels, thereby overcoming a significant obstacle to the clinical implementation of AI-assisted segmentation.

摘要:背景:深度學習在自動化腦轉移(BM)分割方面顯示出顯著潛力;然而,在單一機構訓練的模型在不同地點的表現往往不佳,這是由於掃描儀硬體、影像協議和患者人口統計的差異。本研究的目標是創建一個領域適應框架,使得BM分割能夠在多個機構之間使用。
方法:我們提出了一個VAE-MMD預處理管道,該管道將變分自編碼器(VAE)與最大均值差異(MMD)損失結合,並在nnU-Net分割的基礎上融入跳躍連接和自注意力機制。該方法在來自四個公共數據庫的740名患者上進行了測試:斯坦福大學、加州大學舊金山分校、馬德里大學和PKG,通過領域分類器的準確性、靈敏度、精確度、F1/F2分數、表面Dice(sDice)和第95百分位Hausdorff距離(HD95)進行評估。
結果:VAE-MMD將領域分類器的準確性從0.91降低到0.50,表明成功實現了跨機構的特徵對齊。重建的體積達到了大於36 dB的PSNR,保持了解剖準確性。該綜合方法使得平均F1提高了11.1%(從0.700到0.778),平均sDice提高了7.93%(從0.7121到0.7686),並且在四個中心之間相比基線nnU-Net將平均HD95降低了65.5%(從11.33到3.91 mm)。
結論:VAE-MMD有效減少了跨機構數據的異質性,並增強了BM分割在體積、檢測和邊界級別指標上的泛化能力,而無需目標領域標籤,從而克服了臨床實施AI輔助分割的一個重大障礙。

EvolveTool-Bench: Evaluating the Quality of LLM-Generated Tool Libraries as Software Artifacts

2604.00392v1 by Alibek T. Kaliyev, Artem Maryanskyy

Modern LLM agents increasingly create their own tools at runtime -- from Python functions to API clients -- yet existing benchmarks evaluate them almost exclusively by downstream task completion. This is analogous to judging a software engineer only by whether their code runs, ignoring redundancy, regression, and safety. We introduce EvolveTool-Bench, a diagnostic benchmark for LLM-generated tool libraries in software engineering workflows. Across three domains requiring actual tool execution (proprietary data formats, API orchestration, and numerical computation), we define library-level software quality metrics -- reuse, redundancy, composition success, regression stability, and safety -- alongside a per-tool Tool Quality Score measuring correctness, robustness, generality, and code quality. In the first head-to-head comparison of code-level and strategy-level tool evolution (ARISE vs. EvoSkill vs. one-shot baselines, 99 tasks, two models), we show that systems with similar task completion (63-68%) differ by up to 18% in library health, revealing software quality risks invisible to task-only evaluation. Our results highlight that evaluation and governance of LLM-generated tools require treating the evolving tool library as a first-class software artifact, not a black box.

摘要:現代的 LLM 代理越來越多地在運行時創建自己的工具——從 Python 函數到 API 客戶端——然而現有的基準幾乎僅通過下游任務的完成來評估它們。這類似於僅根據一名軟體工程師的程式碼是否能運行來評價他們,卻忽略了冗餘、回歸和安全性。我們引入了 EvolveTool-Bench,這是一個用於 LLM 生成的工具庫在軟體工程工作流程中的診斷基準。在三個需要實際工具執行的領域(專有數據格式、API 協調和數值計算)中,我們定義了庫級軟體質量指標——重用、冗餘、組合成功、回歸穩定性和安全性——以及每個工具的工具質量分數,該分數衡量正確性、穩健性、通用性和程式碼質量。在首次的代碼級和策略級工具演化的正面比較中(ARISE vs. EvoSkill vs. 一次性基準,99 個任務,兩個模型),我們顯示出具有相似任務完成率(63-68%)的系統在庫健康度上最多相差 18%,揭示了僅依賴任務評估而無法察覺的軟體質量風險。我們的結果強調,對 LLM 生成工具的評估和治理需要將不斷演變的工具庫視為一級軟體工件,而不是黑箱。

Collaborative AI Agents and Critics for Fault Detection and Cause Analysis in Network Telemetry

2604.00319v1 by Syed Eqbal Alam, Zhan Shu

We develop algorithms for collaborative control of AI agents and critics in a multi-actor, multi-critic federated multi-agent system. Each AI agent and critic has access to classical machine learning or generative AI foundation models. The AI agents and critics collaborate with a central server to complete multimodal tasks such as fault detection, severity, and cause analysis in a network telemetry system, text-to-image generation, video generation, healthcare diagnostics from medical images and patient records, etcetera. The AI agents complete their tasks and send them to AI critics for evaluation. The critics then send feedback to agents to improve their responses. Collaboratively, they minimize the overall cost to the system with no inter-agent or inter-critic communication. AI agents and critics keep their cost functions or derivatives of cost functions private. Using multi-time scale stochastic approximation techniques, we provide convergence guarantees on the time-average active states of AI agents and critics. The communication overhead is a little on the system, of the order of $\mathcal{O}(m)$, for $m$ modalities and is independent of the number of AI agents and critics. Finally, we present an example of fault detection, severity, and cause analysis in network telemetry and thorough evaluation to check the algorithm's efficacy.

摘要:我們為多演員、多評論者的聯邦多智能體系統開發了協作控制的算法。每個AI智能體和評論者都可以訪問傳統機器學習或生成AI基礎模型。AI智能體和評論者與中央伺服器合作,以完成多模態任務,例如在網絡遙測系統中的故障檢測、嚴重性和原因分析、文本到圖像生成、視頻生成、從醫療影像和病歷進行的醫療診斷等。AI智能體完成其任務並將其發送給AI評論者進行評估。評論者然後向智能體發送反饋,以改善其回應。他們協同工作,最小化系統的整體成本,且不進行智能體或評論者之間的通信。AI智能體和評論者將其成本函數或成本函數的導數保持私密。使用多時間尺度隨機近似技術,我們提供了AI智能體和評論者的時間平均活動狀態的收斂保證。通信開銷對系統的影響較小,約為$\mathcal{O}(m)$,其中$m$為模態數,且與AI智能體和評論者的數量無關。最後,我們展示了一個在網絡遙測中進行故障檢測、嚴重性和原因分析的例子,以及徹底的評估以檢查算法的有效性。

SANA I2I: A Text Free Flow Matching Framework for Paired Image to Image Translation with a Case Study in Fetal MRI Artifact Reduction

2604.00298v1 by Italo Felix Santos, Gilson Antonio Giraldi, Heron Werner Junior

We propose SANA-I2I, a text-free high-resolution image-to-image generation framework that extends the SANA family by removing textual conditioning entirely. In contrast to SanaControlNet, which combines text and image-based control, SANA-I2I relies exclusively on paired source-target images to learn a conditional flow-matching model in latent space. The model learns a conditional velocity field that maps a target image distribution to another one, enabling supervised image translation without reliance on language prompts. We evaluate the proposed approach on the challenging task of fetal MRI motion artifact reduction. To enable paired training in this application, where real paired data are difficult to acquire, we adopt a synthetic data generation strategy based on the method proposed by Duffy et al., which simulates realistic motion artifacts in fetal magnetic resonance imaging (MRI). Experimental results demonstrate that SANA-I2I effectively suppresses motion artifacts while preserving anatomical structure, achieving competitive performance few inference steps. These results highlight the efficiency and suitability of our proposed flow-based, text-free generative models for supervised image-to-image tasks in medical imaging.

摘要:我們提出了 SANA-I2I,一個無文本的高解析度圖像到圖像生成框架,通過完全移除文本條件來擴展 SANA 家族。與結合文本和基於圖像控制的 SanaControlNet 相比,SANA-I2I 僅依賴配對的源目標圖像來學習潛在空間中的條件流匹配模型。該模型學習一個條件速度場,將目標圖像分佈映射到另一個圖像,實現無需依賴語言提示的監督圖像轉換。我們在挑戰性的胎兒 MRI 運動伪影減少任務上評估了所提出的方法。為了在這一應用中實現配對訓練,因為真實配對數據難以獲得,我們採用了基於 Duffy 等人提出的方法的合成數據生成策略,該方法模擬胎兒磁共振成像 (MRI) 中的真實運動伪影。實驗結果表明,SANA-I2I 有效抑制運動伪影,同時保留解剖結構,實現了在少量推斷步驟下的競爭性能。這些結果突顯了我們提出的基於流的無文本生成模型在醫學影像中進行監督圖像到圖像任務的效率和適用性。

A Safety-Aware Role-Orchestrated Multi-Agent LLM Framework for Behavioral Health Communication Simulation

2604.00249v1 by Ha Na Cho

Single-agent large language model (LLM) systems struggle to simultaneously support diverse conversational functions and maintain safety in behavioral health communication. We propose a safety-aware, role-orchestrated multi-agent LLM framework designed to simulate supportive behavioral health dialogue through coordinated, role-differentiated agents. Conversational responsibilities are decomposed across specialized agents, including empathy-focused, action-oriented, and supervisory roles, while a prompt-based controller dynamically activates relevant agents and enforces continuous safety auditing. Using semi-structured interview transcripts from the DAIC-WOZ corpus, we evaluate the framework with scalable proxy metrics capturing structural quality, functional diversity, and computational characteristics. Results illustrate clear role differentiation, coherent inter-agent coordination, and predictable trade-offs between modular orchestration, safety oversight, and response latency when compared to a single-agent baseline. This work emphasizes system design, interpretability, and safety, positioning the framework as a simulation and analysis tool for behavioral health informatics and decision-support research rather than a clinical intervention.

摘要:單一代理的大型語言模型(LLM)系統在同時支持多樣化的對話功能和維持行為健康溝通的安全性方面面臨挑戰。我們提出了一種安全意識、角色協調的多代理LLM框架,旨在通過協調的、角色區分的代理模擬支持性的行為健康對話。對話責任被分解到專門的代理中,包括以同理心為重點、以行動為導向和監督角色,而基於提示的控制器則動態激活相關代理並執行持續的安全審核。使用來自DAIC-WOZ語料庫的半結構化訪談轉錄,我們利用可擴展的代理指標評估該框架,捕捉結構質量、功能多樣性和計算特徵。結果顯示出明確的角色區分、一致的代理間協調,以及與單一代理基準相比,模組化協調、安全監督和響應延遲之間的可預測權衡。這項工作強調系統設計、可解釋性和安全性,將該框架定位為行為健康信息學和決策支持研究的模擬和分析工具,而非臨床干預。

One Panel Does Not Fit All: Case-Adaptive Multi-Agent Deliberation for Clinical Prediction

2604.00085v1 by Yuxing Lu, Yushuhong Lin, Jason Zhang

Large language models applied to clinical prediction exhibit case-level heterogeneity: simple cases yield consistent outputs, while complex cases produce divergent predictions under minor prompt changes. Existing single-agent strategies sample from one role-conditioned distribution, and multi-agent frameworks use fixed roles with flat majority voting, discarding the diagnostic signal in disagreement. We propose CAMP (Case-Adaptive Multi-agent Panel), where an attending-physician agent dynamically assembles a specialist panel tailored to each case's diagnostic uncertainty. Each specialist evaluates candidates via three-valued voting (KEEP/REFUSE/NEUTRAL), enabling principled abstention outside one's expertise. A hybrid router directs each diagnosis through strong consensus, fallback to the attending physician's judgment, or evidence-based arbitration that weighs argument quality over vote counts. On diagnostic prediction and brief hospital course generation from MIMIC-IV across four LLM backbones, CAMP consistently outperforms strong baselines while consuming fewer tokens than most competing multi-agent methods, with voting records and arbitration traces offering transparent decision audits.

摘要:大型語言模型應用於臨床預測顯示出案例層級的異質性:簡單案例產生一致的輸出,而複雜案例在輕微提示變更下產生不同的預測。現有的單一代理策略從一個角色條件分布中進行抽樣,而多代理框架使用固定角色和平坦的多數投票,忽略了不一致中的診斷信號。我們提出了 CAMP(案例自適應多代理小組),其中一名主治醫生代理根據每個案例的診斷不確定性動態組建專家小組。每位專家通過三值投票(KEEP/REFUSE/NEUTRAL)評估候選人,使得在自己專業範疇之外能夠有原則地選擇不投票。一個混合路由器將每個診斷引導通過強共識、回退到主治醫生的判斷,或根據證據的仲裁,權衡論點質量而非投票數量。在 MIMIC-IV 上進行的診斷預測和簡短住院過程生成中,CAMP 在四個 LLM 主幹上始終超越強基準,同時消耗的標記數量少於大多數競爭的多代理方法,投票記錄和仲裁痕跡提供了透明的決策審計。

Physiological and Semantic Patterns in Medical Teams Using an Intelligent Tutoring System

2603.29950v1 by Xiaoshan Huang, Conrad Borchers, Jiayi Zhang, Susanne P. Lajoie

Effective collaboration requires teams to manage complex cognitive and emotional states through Socially Shared Regulation of Learning (SSRL). Physiological synchrony (i.e., longitudinal alignment in physiological signals) can indicate these states, but is hard to interpret on its own. We investigate the physiological and conversational dynamics of four medical dyads diagnosing a virtual patient case using an intelligent tutoring system. Semantic shifts in dialogue were correlated with transient physiological synchrony peaks. We also coded utterance segments for SSRL and derived cosine similarity using sentence embeddings. The results showed that activating prior knowledge featured significantly lower semantic similarity than simpler task execution. High physiological synchrony was associated with lower semantic similarity, suggesting that such moments involve exploratory and varied language use. Qualitative analysis triangulated these synchrony peaks as ``pivotal moments'': successful teams synchronized during shared discovery, while unsuccessful teams peaked during shared uncertainty. This research advances human-centered AI by demonstrating how biological signals can be fused with dialogues to understand critical moments in problem solving.

摘要:有效的合作需要團隊通過社會共享學習調節(SSRL)來管理複雜的認知和情感狀態。生理同步(即生理信號的縱向對齊)可以指示這些狀態,但單獨解釋起來較為困難。我們研究了四個醫療二人組在使用智能輔導系統診斷虛擬病人案例時的生理和對話動態。對話中的語義變化與瞬時生理同步峰值相關聯。我們還對發言片段進行了SSRL編碼,並使用句子嵌入推導了餘弦相似度。結果顯示,激活先前知識的語義相似度顯著低於較簡單的任務執行。高生理同步與較低的語義相似度相關,這表明這樣的時刻涉及探索性和多樣化的語言使用。質性分析將這些同步峰值三角測量為“關鍵時刻”:成功的團隊在共享發現時同步,而不成功的團隊在共享不確定性時達到峰值。本研究通過展示如何將生物信號與對話融合,以理解問題解決中的關鍵時刻,推進了以人為中心的人工智慧。

Four Generations of Quantum Biomedical Sensors

2603.29944v2 by Xin Jin, Priyam Srivastava, Ronghe Wang, Yuqing Li, Jonathan Beaumariage, Tom Purdy, M. V. Gurudev Dutt, Kang Kim, Kaushik Seshadreesan, Junyu Liu

Quantum sensing technologies offer transformative potential for ultra-sensitive biomedical sensing, yet their clinical translation remains constrained by classical noise limits and a reliance on macroscopic ensembles. We propose a unifying generational framework to organize the evolving landscape of quantum biosensors based on their utilization of quantum resources. First-generation devices utilize discrete energy levels for signal transduction but follow classical scaling laws. Second-generation sensors exploit quantum coherence to reach the standard quantum limit, while third-generation architectures leverage entanglement and spin squeezing to approach Heisenberg-limited precision. We further define an emerging fourth generation characterized by the end-to-end integration of quantum sensing with quantum learning and variational circuits, enabling adaptive inference directly within the quantum domain. By analyzing critical parameters such as bandwidth matching and sensor-tissue proximity, we identify key technological bottlenecks and propose a roadmap for transitioning from measuring physical observables to extracting structured biological information with quantum-enhanced intelligence.

摘要:量子感測技術為超敏感生物醫學感測提供了變革性的潛力,然而其臨床轉化仍受到經典噪聲限制和依賴宏觀集合的制約。我們提出了一個統一的生成框架,以組織基於量子資源利用的量子生物感測器的演變格局。第一代設備利用離散能級進行信號轉導,但遵循經典縮放法則。第二代感測器利用量子相干性達到標準量子極限,而第三代架構則利用糾纏和自旋擠壓接近海森堡限制的精度。我們進一步定義了一個新興的第四代,其特徵是量子感測與量子學習和變分電路的端到端整合,實現了在量子領域內直接進行自適應推斷。通過分析帶寬匹配和感測器與組織的接近等關鍵參數,我們識別出主要技術瓶頸,並提出了一個從測量物理可觀察量到提取結構化生物信息的量子增強智能的過渡路線圖。

ScoringBench: A Benchmark for Evaluating Tabular Foundation Models with Proper Scoring Rules

2603.29928v1 by Jonas Landsgesell, Pascal Knoll

Tabular foundation models such as TabPFN and TabICL already produce full predictive distributions yet prevailing regression benchmarks evaluate them almost exclusively via point estimate metrics RMSE R2 These aggregate measures often obscure model performance in the tails of the distribution a critical deficit for high stakes decision making in domains like finance and clinical research where asymmetric risk profiles are the norm We introduce ScoringBench an open benchmark that computes a comprehensive suite of proper scoring rules like CRPS CRLS Interval Score Energy Score weighted CRPS and Brier Score alongside standard point metrics providing a richer picture of probabilistic forecast quality We evaluate realTabPFNv2.5 fine tuned with different scoring rule objectives and TabICL relative to untuned realTabPFNv2.5 across a suite of regression benchmarks Our results confirm that model rankings depend on the chosen scoring rule and that no single pretraining objective is universally optimal This demonstrates that for applications sensitive to extreme events the choice of evaluation metric is as much a domain specific requirement as the data itself ScoringBench is available at https://github.com/jonaslandsgesell/ScoringBench A live preview of the current leaderboard is available at https://scoringbench.bolt.host The leaderboard is maintained via git pull requests to ensure transparency traceability agility and reproducibility

摘要:表格基礎模型如 TabPFN 和 TabICL 已經能產生完整的預測分佈,但目前的回歸基準幾乎僅通過點估計指標 RMSE 和 R2 來評估它們。這些綜合指標往往掩蓋了模型在分佈尾部的表現,這對於金融和臨床研究等高風險決策領域來說是一個關鍵的缺陷,因為不對稱風險輪廓是常態。我們介紹了 ScoringBench,一個開放的基準,計算一套全面的適當評分規則,如 CRPS、CRLS、區間分數、能量分數、加權 CRPS 和 Brier 分數,並與標準點指標一起提供更豐富的概率預測質量圖景。我們評估了經過不同評分規則目標微調的 realTabPFNv2.5 和未經調整的 realTabPFNv2.5 相對於一系列回歸基準的 TabICL。我們的結果確認模型排名取決於所選的評分規則,且沒有單一的預訓練目標是普遍最佳的。這表明,對於對極端事件敏感的應用,評估指標的選擇與數據本身一樣,是一個特定於領域的要求。ScoringBench 可在 https://github.com/jonaslandsgesell/ScoringBench 獲得。當前排行榜的實時預覽可在 https://scoringbench.bolt.host 獲得。排行榜通過 git pull 請求進行維護,以確保透明度、可追溯性、靈活性和可重現性。

Training deep learning based dynamic MR image reconstruction using synthetic fractals

2603.29922v1 by Anirudh Raman, Olivier Jaubert, Mark Wrobel, Tina Yao, Ruaraidh Campbell, Rebecca Baker, Ruta Virsinskaite, Daniel Knight, Michael Quail, Jennifer Steeden, Vivek Muthurangu

Purpose: To investigate whether synthetically generated fractal data can be used to train deep learning (DL) models for dynamic MRI reconstruction, thereby avoiding the privacy, licensing, and availability limitations associated with cardiac MR training datasets. Methods: A training dataset was generated using quaternion Julia fractals to produce 2D+time images. Multi-coil MRI acquisition was simulated to generate paired fully sampled and radially undersampled k-space data. A 3D UNet deep artefact suppression model was trained using these fractal data (F-DL) and compared with an identical model trained on cardiac MRI data (CMR-DL). Both models were evaluated on prospectively acquired radial real-time cardiac MRI from 10 patients. Reconstructions were compared against compressed sensing(CS) and low-rank deep image prior (LR-DIP). All reconstrctuions were ranked for image quality, while ventricular volumes and ejection fraction were compared with reference breath-hold cine MRI. Results: There was no significant difference in qualitative ranking between F-DL and CMR-DL (p=0.9), while both outperformed CS and LR-DIP (p<0.001). Ventricular volumes and function derived from F-DL were similar to CMR-DL, showing no significant bias and accptable limits of agreement compared to reference cine imaging. However, LR-DIP had a signifcant bias (p=0.016) and wider lmits of agreement. Conclusion: DL models trained using synthetic fractal data can reconstruct real-time cardiac MRI with image quality and clinical measurements comparable to models trained on true cardiac MRI data. Fractal training data provide an open, scalable alternative to clinical datasets and may enable development of more generalisable DL reconstruction models for dynamic MRI.

摘要:目的:調查合成生成的分形數據是否可以用於訓練深度學習(DL)模型以進行動態MRI重建,從而避免與心臟MRI訓練數據集相關的隱私、授權和可用性限制。方法:使用四元數Julia分形生成訓練數據集,以產生2D+時間影像。模擬多線圈MRI獲取,以生成配對的完全採樣和徑向欠採樣的k空間數據。使用這些分形數據(F-DL)訓練了一個3D UNet深度伪影抑制模型,並與在心臟MRI數據(CMR-DL)上訓練的相同模型進行比較。這兩個模型在10名患者的前瞻性獲取的徑向實時心臟MRI上進行評估。重建結果與壓縮感知(CS)和低秩深度影像先驗(LR-DIP)進行比較。所有重建結果根據影像質量進行排名,而心室體積和射血分數則與參考的屏息動態MRI進行比較。結果:F-DL和CMR-DL之間的質量排名沒有顯著差異(p=0.9),而且兩者均優於CS和LR-DIP(p<0.001)。從F-DL衍生的心室體積和功能與CMR-DL相似,顯示出與參考動態影像相比沒有顯著偏差和可接受的協議範圍。然而,LR-DIP有顯著偏差(p=0.016)和更寬的協議範圍。結論:使用合成分形數據訓練的DL模型可以重建實時心臟MRI,其影像質量和臨床測量可與基於真實心臟MRI數據訓練的模型相媲美。分形訓練數據提供了一種開放、可擴展的替代臨床數據集的選擇,並可能促進更具通用性的動態MRI重建DL模型的發展。

Brain MR Image Synthesis with Multi-contrast Self-attention GAN

2604.00070v1 by Zaid A. Abod, Furqan Aziz

Accurate and complete multi-modal Magnetic Resonance Imaging (MRI) is essential for neuro-oncological assessment, as each contrast provides complementary anatomical and pathological information. However, acquiring all modalities (e.g., T1c, T1n, T2, T2f) for every patient is often impractical due to time, cost, and patient discomfort, potentially limiting comprehensive tumour evaluation. We propose 3D-MC-SAGAN (3D Multi-Contrast Self-Attention generative adversarial network), a unified 3D multi-contrast synthesis framework that generates high-fidelity missing modalities from a single T2 input while explicitly preserving tumour characteristics. The model employs a multi-scale 3D encoder-decoder generator with residual connections and a novel Memory-Bounded Hybrid Attention (MBHA) block to capture long-range dependencies efficiently, and is trained with a WGAN-GP critic and an auxiliary contrast-conditioning branch to produce T2f, T1n, and T1c volumes within a single unified network. A frozen 3D U-Net-based segmentation module introduces a segmentation-consistency constraint to preserve lesion morphology. The composite objective integrates adversarial, reconstruction, perceptual, structural similarity, contrast-classification, and segmentation-guided losses to align global realism with tumour-preserving structure. Extensive evaluation on 3D brain MRI datasets demonstrates that 3D-MC-SAGAN achieves state-of-the-art quantitative performance and generates visually coherent, anatomically plausible contrasts with improved distribution-level realism. Moreover, it maintains tumour segmentation accuracy comparable to fully acquired multi-modal inputs, highlighting its potential to reduce acquisition burden while preserving clinically meaningful information.

摘要:準確且完整的多模態磁共振成像(MRI)對於神經腫瘤評估至關重要,因為每種對比提供互補的解剖和病理信息。然而,由於時間、成本和患者不適,為每位患者獲取所有模態(例如,T1c、T1n、T2、T2f)通常是不切實際的,這可能限制了全面的腫瘤評估。我們提出了3D-MC-SAGAN(3D多對比自注意生成對抗網絡),這是一個統一的3D多對比合成框架,能從單一的T2輸入生成高保真度的缺失模態,同時明確保留腫瘤特徵。該模型採用多尺度3D編碼器-解碼器生成器,具有殘差連接和新穎的記憶約束混合注意(MBHA)模塊,以高效捕捉長距離依賴性,並使用WGAN-GP評價器和輔助對比條件分支進行訓練,以在單一統一網絡內生成T2f、T1n和T1c體積。一個凍結的3D U-Net基於分割模塊引入了分割一致性約束,以保留病變形態。綜合目標整合了對抗性、重建、感知、結構相似性、對比分類和分割引導損失,以將全球現實與腫瘤保護結構對齊。在3D腦MRI數據集上的廣泛評估表明,3D-MC-SAGAN實現了最先進的定量性能,並生成視覺上連貫、解剖上合理的對比,具有改善的分佈級現實性。此外,它保持的腫瘤分割準確性可與完全獲取的多模態輸入相媲美,突顯了其在減少獲取負擔的同時保留臨床有意義信息的潛力。

Symphony for Medical Coding: A Next-Generation Agentic System for Scalable and Explainable Medical Coding

2603.29709v1 by Joakim Edin, Andreas Motzfeldt, Simon Flachs, Lars Maaløe

Medical coding translates free-text clinical documentation into standardized codes drawn from classification systems that contain tens of thousands of entries and are updated annually. It is central to billing, clinical research, and quality reporting, yet remains largely manual, slow, and error-prone. Existing automated approaches learn to predict a fixed set of codes from labeled data, thereby preventing adaptation to new codes or different coding systems without retraining on different data. They also provide no explanation for their predictions, limiting trust in safety-critical settings. We introduce Symphony for Medical Coding, a system that approaches the task the way expert human coders do: by reasoning over the clinical narrative with direct access to the coding guidelines. This design allows Symphony to operate across any coding system and to provide span-level evidence linking each predicted code to the text that supports it. We evaluate on two public benchmarks and three real-world datasets spanning inpatient, outpatient, emergency, and subspecialty settings across the United States and the United Kingdom. Symphony achieves state-of-the-art results across all settings, establishing itself as a flexible, deployment-ready foundation for automated clinical coding.

摘要:醫療編碼將自由文本的臨床文檔轉換為標準化代碼,這些代碼來自包含數萬條條目的分類系統,並每年更新一次。它對於計費、臨床研究和質量報告至關重要,但仍然主要依賴手動操作,速度慢且容易出錯。現有的自動化方法從標記數據中學習預測固定的代碼集,因此無法在不重新訓練不同數據的情況下適應新代碼或不同的編碼系統。它們也不提供對其預測的解釋,限制了在安全關鍵環境中的信任。我們介紹了醫療編碼的交響樂系統,這一系統以專業人類編碼員的方式處理任務:通過直接訪問編碼指南來推理臨床敘事。這一設計使得交響樂能夠在任何編碼系統中運作,並提供將每個預測代碼與支持其的文本聯繫起來的跨度級證據。我們在兩個公共基準和三個涵蓋美國和英國住院、門診、急診和專科環境的真實世界數據集上進行評估。交響樂在所有環境中都達到了最先進的結果,確立了其作為自動化臨床編碼的靈活、可部署的基礎。

Few-shot Writer Adaptation via Multimodal In-Context Learning

2603.29450v1 by Tom Simon, Stephane Nicolas, Pierrick Tranouez, Clement Chatelain, Thierry Paquet

While state-of-the-art Handwritten Text Recognition (HTR) models perform well on standard benchmarks, they frequently struggle with writers exhibiting highly specific styles that are underrepresented in the training data. To handle unseen and atypical writers, writer adaptation techniques personalize HTR models to individual handwriting styles. Leading writer adaptation methods require either offline fine-tuning or parameter updates at inference time, both involving gradient computation and backpropagation, which increase computational costs and demand careful hyperparameter tuning. In this work, we propose a novel context-driven HTR framework3 inspired by multimodal in-context learning, enabling inference-time writer adaptation using only a few examples from the target writer without any parameter updates. We further demonstrate the impact of context length, design a compact 8M-parameter CNN-Transformer that enables few-shot in-context adaptation, and show that combining context-driven and standard OCR training strategies leads to complementary improvements. Experiments on IAM and RIMES validate our approach with Character Error Rates of 3.92% and 2.34%, respectively, surpassing all writer-independent HTR models without requiring any parameter updates at inference time.

摘要:雖然最先進的手寫文本識別(HTR)模型在標準基準測試中表現良好,但它們經常在處理具有高度特定風格的寫作者時遇到困難,這些風格在訓練數據中代表性不足。為了處理未見過的和不典型的寫作者,寫作者適應技術使 HTR 模型個性化以符合個別的手寫風格。領先的寫作者適應方法需要離線微調或在推理時更新參數,這兩者都涉及梯度計算和反向傳播,從而增加計算成本並要求仔細的超參數調整。在這項工作中,我們提出了一種新穎的基於上下文的 HTR 框架,受到多模態上下文學習的啟發,使得在推理時僅使用目標寫作者的幾個示例即可進行寫作者適應,而無需任何參數更新。我們進一步展示了上下文長度的影響,設計了一個緊湊的 8M 參數 CNN-Transformer,使得少量示例的上下文適應成為可能,並顯示結合基於上下文和標準 OCR 訓練策略會帶來互補的改進。在 IAM 和 RIMES 上的實驗驗證了我們的方法,字符錯誤率分別為 3.92% 和 2.34%,超越了所有不依賴寫作者的 HTR 模型,且在推理時無需任何參數更新。

NeoNet: An End-to-End 3D MRI-Based Deep Learning Framework for Non-Invasive Prediction of Perineural Invasion via Generation-Driven Classification

2603.29449v1 by Youngung Han, Minkyung Cha, Kyeonghun Kim, Induk Um, Myeongbin Sho, Joo Young Bae, Jaewon Jung, Jung Hyeok Park, Seojun Lee, Nam-Joon Kim, Woo Kyoung Jeong, Won Jae Lee, Pa Hong, Ken Ying-Kai Liao, Hyuk-Jae Lee

Minimizing invasive diagnostic procedures to reduce the risk of patient injury and infection is a central goal in medical imaging. And yet, noninvasive diagnosis of perineural invasion (PNI), a critical prognostic factor involving infiltration of tumor cells along the surrounding nerve, still remains challenging, due to the lack of clear and consistent imaging criteria criteria for identifying PNI. To address this challenge, we present NeoNet, an integrated end-to-end 3D deep learning framework for PNI prediction in cholangiocarcinoma that does not rely on predefined image features. NeoNet integrates three modules: (1) NeoSeg, utilizing a Tumor-Localized ROI Crop (TLCR) algorithm; (2) NeoGen, a 3D Latent Diffusion Model (LDM) with ControlNet, conditioned on anatomical masks to generate synthetic image patches, specifically balancing the dataset to a 1:1 ratio; and (3) NeoCls, the final prediction module. For NeoCls, we developed the PNI-Attention Network (PattenNet), which uses the frozen LDM encoder and specialized 3D Dual Attention Blocks (DAB) designed to detect subtle intensity variations and spatial patterns indicative of PNI. In 5-fold cross-validation, NeoNet outperformed baseline 3D models and achieved the highest performance with a maximum AUC of 0.7903.

摘要:最小化侵入性診斷程序以降低病人受傷和感染的風險是醫學影像中的一個核心目標。然而,周圍神經侵犯(PNI)的非侵入性診斷仍然具有挑戰性,這是一個關鍵的預後因素,涉及腫瘤細胞沿著周圍神經的浸潤,因為缺乏明確且一致的影像標準來識別PNI。為了解決這一挑戰,我們提出了NeoNet,一個集成的端到端3D深度學習框架,用於膽管癌中的PNI預測,該框架不依賴於預定義的影像特徵。NeoNet整合了三個模塊: (1) NeoSeg,利用腫瘤定位ROI裁剪(TLCR)算法; (2) NeoGen,一個帶有ControlNet的3D潛在擴散模型(LDM),基於解剖掩膜生成合成影像區塊,特別是將數據集平衡到1:1的比例;以及 (3) NeoCls,最終的預測模塊。對於NeoCls,我們開發了PNI-注意力網絡(PattenNet),該網絡使用凍結的LDM編碼器和專門設計的3D雙重注意力區塊(DAB),旨在檢測微妙的強度變化和空間模式,以指示PNI。在5折交叉驗證中,NeoNet的表現超過了基線3D模型,並以最高0.7903的AUC達到了最佳性能。

AI-Generated Prior Authorization Letters: Strong Clinical Content, Weak Administrative Scaffolding

2603.29366v1 by Moiz Sadiq Awan, Maryam Raza

Prior authorization remains one of the most burdensome administrative processes in U.S. healthcare, consuming billions of dollars and thousands of physician hours each year. While large language models have shown promise across clinical text tasks, their ability to produce submission-ready prior authorization letters has received only limited attention, with existing work confined to single-case demonstrations rather than structured multi-scenario evaluation. We assessed three commercially available LLMs (GPT-4o, Claude Sonnet 4.5, and Gemini 2.5 Pro) across 45 physician-validated synthetic scenarios spanning rheumatology, psychiatry, oncology, cardiology, and orthopedics. All three models generated letters with strong clinical content: accurate diagnoses, well-structured medical necessity arguments, and thorough step therapy documentation. However, a secondary analysis of real-world administrative requirements revealed consistent gaps that clinical scoring alone did not capture, including absent billing codes, missing authorization duration requests, and inadequate follow-up plans. These findings reframe the question: the challenge for clinical deployment is not whether LLMs can write clinically adequate letters, but whether the systems built around them can supply the administrative precision that payer workflows require.

摘要:事前授權仍然是美國醫療保健中最繁重的行政流程之一,每年耗費數十億美元和數千小時的醫生時間。雖然大型語言模型在臨床文本任務中顯示出潛力,但它們產生提交準備好的事前授權信的能力僅受到有限的關注,現有的工作僅限於單一案例的示範,而非結構化的多場景評估。我們評估了三種商業可用的LLM(GPT-4o、Claude Sonnet 4.5和Gemini 2.5 Pro),涵蓋了45個醫生驗證的合成場景,涉及風濕病學、精神病學、腫瘤學、心臟病學和骨科。這三個模型生成的信件具有強大的臨床內容:準確的診斷、結構良好的醫療必要性論證以及全面的步驟治療文檔。然而,對現實世界行政要求的次要分析揭示了一致的差距,這些差距僅僅依靠臨床評分無法捕捉,包括缺失的計費代碼、缺少的授權期限請求和不充分的後續計劃。這些發現重新框定了問題:臨床部署的挑戰不在於LLM是否能撰寫臨床上合格的信件,而在於圍繞它們構建的系統是否能提供支付者工作流程所需的行政精確性。

Predicting Neuromodulation Outcome for Parkinson's Disease with Generative Virtual Brain Model

2603.29176v1 by Siyuan Du, Siyi Li, Shuwei Bai, Ang Li, Haolin Li, Mingqing Xiao, Yang Pan, Dongsheng Li, Weidi Xie, Yanfeng Wang, Ya Zhang, Chencheng Zhang, Jiangchao Yao

Parkinson's disease (PD) affects over ten million people worldwide. Although temporal interference (TI) and deep brain stimulation (DBS) are promising therapies, inter-individual variability limits empirical treatment selection, increasing non-negligible surgical risk and cost. Previous explorations either resort to limited statistical biomarkers that are insufficient to characterize variability, or employ AI-driven methods which is prone to overfitting and opacity. We bridge this gap with a pretraining-finetuning framework to predict outcomes directly from resting-state fMRI. Critically, a generative virtual brain foundation model, pretrained on a collective dataset (2707 subjects, 5621 sessions) to capture universal disorder patterns, was finetuned on PD cohorts receiving TI (n=51) or DBS (n=55) to yield individualized virtual brains with high fidelity to empirical functional connectivity (r=0.935). By constructing counterfactual estimations between pathological and healthy neural states within these personalized models, we predicted clinical responses (TI: AUPR=0.853; DBS: AUPR=0.915), substantially outperforming baselines. External and prospective validations (n=14, n=11) highlight the feasibility of clinical translation. Moreover, our framework provides state-dependent regional patterns linked to response, offering hypothesis-generating mechanistic insights.

摘要:帕金森病(PD)影響全球超過一千萬人。儘管時間干擾(TI)和深腦刺激(DBS)是有前景的療法,但個體間的變異性限制了實證治療的選擇,增加了不可忽視的手術風險和成本。先前的探索要麼依賴於有限的統計生物標記,這些標記不足以表徵變異性,要麼使用易於過擬合和不透明的 AI 驅動方法。我們通過一個預訓練-微調框架來填補這一空白,直接從靜息態 fMRI 預測結果。關鍵是,一個生成的虛擬大腦基礎模型,在一個集體數據集(2707 名受試者,5621 次會議)上進行預訓練,以捕捉普遍的疾病模式,並在接受 TI(n=51)或 DBS(n=55)的 PD 群體上進行微調,以產生與實證功能連接高度一致的個性化虛擬大腦(r=0.935)。通過在這些個性化模型中構建病理和健康神經狀態之間的反事實估計,我們預測了臨床反應(TI: AUPR=0.853; DBS: AUPR=0.915),顯著超越基準。外部和前瞻性驗證(n=14, n=11)突顯了臨床轉化的可行性。此外,我們的框架提供了與反應相關的狀態依賴區域模式,提供了生成假設的機制見解。

Knowledge database development by large language models for countermeasures against viruses and marine toxins

2603.29149v1 by Hung N. Do, Jessica Z. Kubicek-Sutherland, S. Gnanakaran

Access to the most up-to-date information on medical countermeasures is important for the research and development of effective treatments for viruses and marine toxins. However, there is a lack of comprehensive databases that curate data on viruses and marine toxins, making decisions on medical countermeasures slow and difficult. In this work, we employ two large language models (LLMs) of ChatGPT and Grok to design two comprehensive databases of therapeutic countermeasures for five viruses of Lassa, Marburg, Ebola, Nipah, and Venezuelan equine encephalitis, as well as marine toxins. With high-level human-provided inputs, the two LLMs identify public databases containing data on the five viruses and marine toxins, collect relevant information from these databases and the literature, iteratively cross-validate the collected information, and design interactive webpages for easy access to the curated, comprehensive databases. Notably, the ChatGPT LLM is employed to design agentic AI workflows (consisting of two AI agents for research and decision-making) to rank countermeasures for viruses and marine toxins in the databases. Together, our work explores the potential of LLMs as a scalable, updatable approach for building comprehensive knowledge databases and supporting evidence-based decision-making.

摘要:獲取有關醫療對策的最新資訊對於病毒和海洋毒素的有效治療的研究與開發至關重要。然而,缺乏綜合數據庫來整理有關病毒和海洋毒素的數據,使得醫療對策的決策變得緩慢且困難。在這項工作中,我們使用兩個大型語言模型(LLMs),即ChatGPT和Grok,設計了兩個針對拉薩病毒、馬爾堡病毒、埃博拉病毒、尼帕病毒和委內瑞拉馬腦炎病毒以及海洋毒素的治療對策的綜合數據庫。在高層次的人類提供的輸入下,這兩個LLMs識別出包含五種病毒和海洋毒素數據的公共數據庫,從這些數據庫和文獻中收集相關信息,迭代交叉驗證所收集的信息,並設計互動網頁以便於訪問整理好的綜合數據庫。值得注意的是,ChatGPT LLM被用來設計代理式AI工作流程(由兩個AI代理組成,分別負責研究和決策),以對數據庫中的病毒和海洋毒素對策進行排名。總之,我們的工作探討了LLMs作為可擴展、可更新的方法來建立綜合知識數據庫並支持基於證據的決策的潛力。

Towards Explainable Stakeholder-Aware Requirements Prioritisation in Aged-Care Digital Health

2603.29114v1 by Yuqing Xiao, John Grundy, Anuradha Madugalla, Elizabeth Manias

Requirements engineering for aged-care digital health must account for human aspects, because requirement priorities are shaped not only by technical functionality but also by stakeholders' health conditions, socioeconomics, and lived experience. Knowing which human aspects matter most, and for whom, is critical for inclusive and evidence-based requirements prioritisation. Yet in practice, while some studies have examined human aspects in RE, they have largely relied on expert judgement or model-driven analysis rather than large-scale user studies with meaningful human-in-the-loop validation to determine which aspects matter most and why. To address this gap, we conducted a mixed-methods study with 103 older adults, 105 developers, and 41 caregivers. We first applied an explainable machine learning to identify the human aspects most strongly associated with requirement priorities across 8 aged-care digital health themes, and then conducted 12 semi-structured interviews to validate and interpret the quantitative patterns. The results identify the key human aspects shaping requirement priorities, reveal their directional effects, and expose substantial misalignment across stakeholder groups. Together, these findings show that human-centric requirements analysis should engage stakeholder groups explicitly rather than collapsing their perspectives into a single aggregate view. This paper contributes an identification of the key human aspects driving requirement priorities in aged-care digital health and an explainable, human-centric RE framework that combines ML-derived importance rankings with qualitative validation to surface the stakeholder misalignments that inclusive requirements engineering must address.

摘要:老年護理數位健康的需求工程必須考慮人類因素,因為需求優先級不僅受到技術功能的影響,還受到利益相關者的健康狀況、社會經濟狀況和生活經驗的影響。了解哪些人類因素最重要,以及對誰最重要,對於包容性和基於證據的需求優先級排序至關重要。然而在實踐中,儘管一些研究已經考察了需求工程中的人類因素,但它們在很大程度上依賴於專家判斷或模型驅動的分析,而非大規模的用戶研究,這些研究具有意義的人類參與驗證,以確定哪些因素最重要及其原因。為了填補這一空白,我們對103位老年人、105位開發者和41位護理人員進行了混合方法研究。我們首先應用可解釋的機器學習來識別與8個老年護理數位健康主題的需求優先級最強相關的人類因素,然後進行了12次半結構化訪談,以驗證和解釋定量模式。結果確定了塑造需求優先級的關鍵人類因素,揭示了它們的方向性影響,並暴露了利益相關者群體之間的重大不一致性。這些發現表明,以人為中心的需求分析應該明確地吸引利益相關者群體,而不是將他們的觀點合併為單一的總體觀。本文貢獻了對推動老年護理數位健康需求優先級的關鍵人類因素的識別,以及一個可解釋的以人為中心的需求工程框架,該框架結合了機器學習導出的重要性排名與定性驗證,以揭示包容性需求工程必須解決的利益相關者不一致性。

A Latent Risk-Aware Machine Learning Approach for Predicting Operational Success in Clinical Trials based on TrialsBank

2603.29041v1 by Iness Halimi, Emmanuel Piffo, Oumnia Boudersa, Yvan Marcel Carre Vilmorin, Melissa Ait-ikhlef, Karima Kone, Andy Tan, Augustin Medina, Juliette Hernando, Sheila Ernest, Vatche Bartekian, Karine Lalonde, Mireille E Schnitzer, Gianolli Dorcelus

Clinical trials are characterized by high costs, extended timelines, and substantial operational risk, yet reliable prospective methods for predicting trial success before initiation remain limited. Existing artificial intelligence approaches often focus on isolated metrics or specific development stages and frequently rely on variables unavailable at the trial design phase, limiting real-world applicability. We present a hierarchical latent risk-aware machine learning framework for prospective prediction of clinical trial operational success using a curated subset of TrialsBank, a proprietary AI-ready database developed by Sorintellis, comprising 13,700 trials. Operational success was defined as the ability to initiate, conduct, and complete a clinical trial according to planned timelines, recruitment targets, and protocol specifications through database lock. This approach decomposes operational success prediction into two modeling stages. First, intermediate latent operational risk factors are predicted using more than 180 drug- and trial-level features available before trial initiation. These predicted latent risks are then integrated into a downstream model to estimate the probability of operational success. A staged data-splitting strategy was employed to prevent information leakage, and models were benchmarked using XGBoost, CatBoost, and Explainable Boosting Machines. Across Phase I-III, the framework achieves strong out-of-sample performance, with F1-scores of 0.93, 0.92, and 0.91, respectively. Incorporating latent risk drivers improves discrimination of operational failures, and performance remains robust under independent inference evaluation. These results demonstrate that clinical trial operational success can be prospectively forecasted using a latent risk-aware AI framework, enabling early risk assessment and supporting data-driven clinical development decision-making.

摘要:臨床試驗的特點是高成本、長時間和相當大的操作風險,然而在試驗啟動之前,可靠的預測試驗成功的前瞻性方法仍然有限。現有的人工智慧方法通常專注於孤立的指標或特定的開發階段,並且經常依賴於在試驗設計階段無法獲得的變數,限制了其在現實世界中的適用性。我們提出了一個層次性潛在風險感知的機器學習框架,用於利用經過整理的TrialsBank子集預測臨床試驗的操作成功,TrialsBank是由Sorintellis開發的專有AI準備數據庫,包含13,700個試驗。操作成功被定義為根據計劃的時間表、招募目標和協議規範,啟動、進行和完成臨床試驗的能力,直到數據庫鎖定。這種方法將操作成功的預測分解為兩個建模階段。首先,使用在試驗啟動之前可用的180多個藥物和試驗級別特徵來預測中間潛在操作風險因素。這些預測的潛在風險然後被整合到下游模型中,以估計操作成功的概率。採用了分階段數據拆分策略以防止信息洩漏,並使用XGBoost、CatBoost和可解釋增強機器進行模型基準測試。在第一至第三階段,該框架在樣本外表現強勁,F1分數分別為0.93、0.92和0.91。納入潛在風險驅動因素提高了對操作失敗的區分能力,並且在獨立推斷評估下性能依然穩健。這些結果表明,臨床試驗的操作成功可以通過潛在風險感知的AI框架進行前瞻性預測,使早期風險評估成為可能,並支持基於數據的臨床開發決策。

Human-Like Lifelong Memory: A Neuroscience-Grounded Architecture for Infinite Interaction

2603.29023v1 by Diego C. Lerma-Torres

Large language models lack persistent, structured memory for long-term interaction and context-sensitive retrieval. Expanding context windows does not solve this: recent evidence shows that context length alone degrades reasoning by up to 85% - even with perfect retrieval. We propose a bio-inspired memory framework grounded in complementary learning systems theory, cognitive behavioral therapy's belief hierarchy, dual-process cognition, and fuzzy-trace theory, organized around three principles: (1) Memory has valence, not just content - pre-computed emotional-associative summaries (valence vectors) organized in an emergent belief hierarchy inspired by Beck's cognitive model enable instant orientation before deliberation; (2) Retrieval defaults to System 1 with System 2 escalation - automatic spreading activation and passive priming as default, with deliberate retrieval only when needed, and graded epistemic states that address hallucination structurally; and (3) Encoding is active, present, and feedback-dependent - a thalamic gateway tags and routes information between stores, while the executive forms gists through curiosity-driven investigation, not passive exposure. Seven functional properties specify what any implementation must satisfy. Over time, the system converges toward System 1 processing - the computational analog of clinical expertise - producing interactions that become cheaper, not more expensive, with experience.

摘要:大型語言模型缺乏持久的、結構化的記憶,以便於長期互動和上下文敏感的檢索。擴大上下文窗口並不能解決這個問題:最近的證據顯示,僅僅延長上下文長度會使推理能力下降多達85%——即使檢索是完美的。我們提出了一個受生物啟發的記憶框架,基於互補學習系統理論、認知行為療法的信念層級、雙過程認知和模糊痕跡理論,圍繞三個原則組織:(1) 記憶具有價值,而不僅僅是內容——預先計算的情感聯想摘要(價值向量)組織在一個受貝克認知模型啟發的緊急信念層級中,使得在深思熟慮之前能夠快速定位;(2) 檢索默認為系統1,並在需要時升級到系統2——自動擴散激活和被動引導作為默認情況,只有在需要時才進行深思熟慮的檢索,並且有分級的認知狀態結構性地解決幻覺;(3) 編碼是主動的、當前的,並依賴於反饋——丘腦閘道標記並路由信息在存儲之間,而執行系統則通過好奇心驅動的調查形成要旨,而不是被動的接觸。七個功能屬性規定了任何實現必須滿足的條件。隨著時間的推移,系統向系統1處理收斂——臨床專業知識的計算類比——產生的互動隨著經驗的增長變得更便宜,而不是更昂貴。

Towards a Medical AI Scientist

2603.28589v1 by Hongtao Wu, Boyun Zheng, Dingjie Song, Yu Jiang, Jianfeng Gao, Lei Xing, Lichao Sun, Yixuan Yuan

Autonomous systems that generate scientific hypotheses, conduct experiments, and draft manuscripts have recently emerged as a promising paradigm for accelerating discovery. However, existing AI Scientists remain largely domain-agnostic, limiting their applicability to clinical medicine, where research is required to be grounded in medical evidence with specialized data modalities. In this work, we introduce Medical AI Scientist, the first autonomous research framework tailored to clinical autonomous research. It enables clinically grounded ideation by transforming extensively surveyed literature into actionable evidence through clinician-engineer co-reasoning mechanism, which improves the traceability of generated research ideas. It further facilitates evidence-grounded manuscript drafting guided by structured medical compositional conventions and ethical policies. The framework operates under 3 research modes, namely paper-based reproduction, literature-inspired innovation, and task-driven exploration, each corresponding to a distinct level of automated scientific inquiry with progressively increasing autonomy. Comprehensive evaluations by both large language models and human experts demonstrate that the ideas generated by the Medical AI Scientist are of substantially higher quality than those produced by commercial LLMs across 171 cases, 19 clinical tasks, and 6 data modalities. Meanwhile, our system achieves strong alignment between the proposed method and its implementation, while also demonstrating significantly higher success rates in executable experiments. Double-blind evaluations by human experts and the Stanford Agentic Reviewer suggest that the generated manuscripts approach MICCAI-level quality, while consistently surpassing those from ISBI and BIBM. The proposed Medical AI Scientist highlights the potential of leveraging AI for autonomous scientific discovery in healthcare.

摘要:自主系統生成科學假設、進行實驗並撰寫手稿,最近已成為加速發現的一個有前景的範式。然而,現有的人工智慧科學家在很大程度上仍然是領域無關的,這限制了它們在臨床醫學中的應用,因為研究需要基於醫學證據並具有專門的數據模式。在這項工作中,我們介紹了醫療人工智慧科學家,這是第一個專為臨床自主研究量身定制的自主研究框架。它通過臨床醫生與工程師的共同推理機制,將廣泛調查的文獻轉化為可行的證據,從而實現臨床基礎的創意,並提高生成研究想法的可追溯性。它進一步促進了基於證據的手稿撰寫,並遵循結構化的醫學組合慣例和倫理政策。該框架在三種研究模式下運作,即基於論文的重現、受文獻啟發的創新和任務驅動的探索,每種模式對應於不同層次的自動化科學探究,並逐步增加自主性。大型語言模型和人類專家的全面評估顯示,醫療人工智慧科學家生成的想法在171個案例、19個臨床任務和6種數據模式中,質量顯著高於商業LLM所產生的想法。同時,我們的系統在所提出的方法與其實施之間實現了強大的對齊,並在可執行實驗中顯示出顯著更高的成功率。人類專家和斯坦福代理審稿人的雙盲評估表明,生成的手稿接近MICCAI級別的質量,同時穩定超越來自ISBI和BIBM的手稿。所提出的醫療人工智慧科學家突顯了利用人工智慧進行醫療保健自主科學發現的潛力。

Detecting low left ventricular ejection fraction from ECG using an interpretable and scalable predictor-driven framework

2603.28532v1 by Ya Zhou, Tianxiang Hao, Ziyi Cai, Haojie Zhu, Hejun He, Jia Liu, Xiaohan Fan, Jing Yuan

Low left ventricular ejection fraction (LEF) frequently remains undetected until progression to symptomatic heart failure, underscoring the need for scalable screening strategies. Although artificial intelligence-enabled electrocardiography (AI-ECG) has shown promise, existing approaches rely solely on end-to-end black-box models with limited interpretability or on tabular systems dependent on commercial ECG measurement algorithms with suboptimal performance. We introduced ECG-based Predictor-Driven LEF (ECGPD-LEF), a structured framework that integrates foundation model-derived diagnostic probabilities with interpretable modeling for detecting LEF from ECG. Trained on the benchmark EchoNext dataset comprising 72,475 ECG-echocardiogram pairs and evaluated in predefined independent internal (n=5,442) and external (n=16,017) cohorts, our framework achieved robust discrimination for moderate LEF (internal AUROC 88.4%, F1 64.5%; external AUROC 86.8%, F1 53.6%), consistently outperforming the official end-to-end baseline provided with the benchmark across demographic and clinical subgroups. Interpretability analyses identified high-impact predictors, including normal ECG, incomplete left bundle branch block, and subendocardial injury in anterolateral leads, driving LEF risk estimation. Notably, these predictors independently enabled zero-shot-like inference without task-specific retraining (internal AUROC 75.3-81.0%; external AUROC 71.6-78.6%), indicating that ventricular dysfunction is intrinsically encoded within structured diagnostic probability representations. This framework reconciles predictive performance with mechanistic transparency, supporting scalable enhancement through additional predictors and seamless integration with existing AI-ECG systems.

摘要:低左心室射血分數(LEF)常常在進展到有症狀的心臟衰竭之前未被檢測到,這突顯了可擴展篩查策略的必要性。儘管人工智慧驅動的心電圖(AI-ECG)顯示出潛力,但現有的方法僅依賴於端到端的黑箱模型,這些模型的可解釋性有限,或依賴於依賴商業心電圖測量算法的表格系統,這些算法的性能不佳。我們引入了基於心電圖的預測驅動LEF(ECGPD-LEF),這是一個結構化框架,將基礎模型衍生的診斷概率與可解釋建模相結合,以從心電圖中檢測LEF。該框架在基準EchoNext數據集上進行訓練,該數據集包含72,475對心電圖-超聲心動圖,並在預定的獨立內部(n=5,442)和外部(n=16,017)隊列中進行評估,我們的框架在中度LEF的區分上達到了穩健的表現(內部AUROC 88.4%,F1 64.5%;外部AUROC 86.8%,F1 53.6%),在各個人口和臨床子群中始終優於基準提供的官方端到端基線。可解釋性分析確定了高影響力的預測因子,包括正常心電圖、不完全的左束支傳導阻滯和前外側導聯的心內膜下損傷,這些因子驅動了LEF風險評估。值得注意的是,這些預測因子獨立地實現了類似零樣本的推斷,而無需特定任務的再訓練(內部AUROC 75.3-81.0%;外部AUROC 71.6-78.6%),這表明心室功能障礙本質上在結構化診斷概率表示中被編碼。這一框架調和了預測性能與機制透明度,支持通過額外的預測因子和與現有AI-ECG系統的無縫整合來實現可擴展的增強。

FeDMRA: Federated Incremental Learning with Dynamic Memory Replay Allocation

2603.28455v1 by Tiantian Wang, Xiang Xiang, Simon S. Du

In federated healthcare systems, Federated Class-Incremental Learning (FCIL) has emerged as a key paradigm, enabling continuous adaptive model learning among distributed clients while safeguarding data privacy. However, in practical applications, data across agent nodes within the distributed framework often exhibits non-independent and identically distributed (non-IID) characteristics, rendering traditional continual learning methods inapplicable. To address these challenges, this paper covers more comprehensive incremental task scenarios and proposes a dynamic memory allocation strategy for exemplar storage based on the data replay mechanism. This strategy fully taps into the inherent potential of data heterogeneity, while taking into account the performance fairness of all participating clients, thereby establishing a balanced and adaptive solution to mitigate catastrophic forgetting. Unlike the fixed allocation of client exemplar memory, the proposed scheme emphasizes the rational allocation of limited storage resources among clients to improve model performance. Furthermore, extensive experiments are conducted on three medical image datasets, and the results demonstrate significant performance improvements compared to existing baseline models.

摘要:在聯邦醫療系統中,聯邦類別增量學習(FCIL)已成為一個關鍵範式,使得分散式客戶端之間能夠持續適應性地學習模型,同時保護數據隱私。然而,在實際應用中,分散框架內的代理節點之間的數據往往表現出非獨立同分佈(non-IID)的特徵,這使得傳統的持續學習方法無法適用。為了解決這些挑戰,本文涵蓋了更全面的增量任務場景,並提出了一種基於數據重播機制的示例存儲動態內存分配策略。該策略充分挖掘了數據異質性的內在潛力,同時考慮到所有參與客戶端的性能公平性,從而建立了一個平衡且適應性的解決方案,以減輕災難性遺忘。與客戶端示例內存的固定分配不同,所提出的方案強調在客戶端之間合理分配有限的存儲資源,以提高模型性能。此外,對三個醫學影像數據集進行了廣泛的實驗,結果顯示與現有基準模型相比,性能有顯著提升。

The Scaffold Effect: How Prompt Framing Drives Apparent Multimodal Gains in Clinical VLM Evaluation

2603.28387v1 by Doan Nam Long Vu, Simone Balloccu

Trustworthy clinical AI requires that performance gains reflect genuine evidence integration rather than surface-level artifacts. We evaluate 12 open-weight vision-language models (VLMs) on binary classification across two clinical neuroimaging cohorts, \textsc{FOR2107} (affective disorders) and \textsc{OASIS-3} (cognitive decline). Both datasets come with structural MRI data that carries no reliable individual-level diagnostic signal. Under these conditions, smaller VLMs exhibit gains of up to 58\% F1 upon introduction of neuroimaging context, with distilled models becoming competitive with counterparts an order of magnitude larger. A contrastive confidence analysis reveals that merely \emph{mentioning} MRI availability in the task prompt accounts for 70-80\% of this shift, independent of whether imaging data is present, a domain-specific instance of modality collapse we term the \emph{scaffold effect}. Expert evaluation reveals fabrication of neuroimaging-grounded justifications across all conditions, and preference alignment, while eliminating MRI-referencing behavior, collapses both conditions toward random baseline. Our findings demonstrate that surface evaluations are inadequate indicators of multimodal reasoning, with direct implications for the deployment of VLMs in clinical settings.

摘要:值得信賴的臨床人工智慧要求性能提升反映真實的證據整合,而非表面層次的人工產物。我們在兩個臨床神經影像學群體上對12個開放權重的視覺-語言模型(VLMs)進行二元分類評估,\textsc{FOR2107}(情感障礙)和\textsc{OASIS-3}(認知衰退)。這兩個數據集都包含結構性MRI數據,但不帶有可靠的個體級診斷信號。在這些條件下,較小的VLM在引入神經影像學背景後,F1得分提升高達58\%,而經過提煉的模型與規模大一個數量級的對應模型競爭。對比信心分析顯示,僅僅在任務提示中\emph{提及}MRI的可用性就佔據了70-80\%的變化,與影像數據是否存在無關,這是一個我們稱之為\emph{支架效應}的領域特定的模態崩潰實例。專家評估顯示在所有條件下均存在基於神經影像的理由的虛構,而偏好對齊在消除MRI參考行為的同時,使兩個條件都趨向隨機基線。我們的發現表明,表面評估並不足以作為多模態推理的指標,這對於在臨床環境中部署VLM有直接的影響。

Bit-Identical Medical Deep Learning via Structured Orthogonal Initialization

2603.28040v1 by Yakov Pyotr Shkolnikov

Deep learning training is non-deterministic: identical code with different random seeds produces models that agree on aggregate metrics but disagree on individual predictions, with per-class AUC swings exceeding 20 percentage points on rare clinical classes. We present a framework for verified bit-identical training that eliminates three sources of randomness: weight initialization (via structured orthogonal basis functions), batch ordering (via golden ratio scheduling), and non-deterministic GPU operations (via architecture selection and custom autograd). The pipeline produces MD5-verified identical trained weights across independent runs. On PTB-XL ECG rhythm classification, structured initialization significantly exceeds Kaiming across two architectures (n=20; Conformer p = 0.016, Baseline p < 0.001), reducing aggregate variance by 2-3x and reducing per-class variability on rare rhythms by up to 7.5x (TRIGU range: 4.1pp vs 30.9pp under Kaiming, independently confirmed by 3-fold CV). A four-basis comparison at n=20 shows all structured orthogonal bases produce equivalent performance (Friedman p=0.48), establishing that the contribution is deterministic structured initialization itself, not any particular basis function. Cross-domain validation on seven MedMNIST benchmarks (n=20, all p > 0.14) confirms no performance penalty on standard tasks; per-class analysis on imbalanced tasks (ChestMNIST, RetinaMNIST) shows the same variance reduction on rare classes observed in ECG. Cross-dataset evaluation on three external ECG databases confirms zero-shot generalization (>0.93 AFIB AUC).

摘要:深度學習訓練是非確定性的:相同的代碼在不同的隨機種子下產生的模型在整體指標上達成一致,但在個別預測上存在分歧,對於稀有臨床類別,每類的AUC波動超過20個百分點。我們提出了一個經過驗證的位元相同訓練框架,消除了三個隨機性來源:權重初始化(通過結構化正交基函數)、批次排序(通過黃金比例調度)和非確定性GPU操作(通過架構選擇和自定義自動微分)。該管道在獨立運行中產生MD5驗證的相同訓練權重。在PTB-XL ECG節律分類中,結構化初始化在兩個架構中顯著超過Kaiming(n=20;Conformer p = 0.016,Baseline p < 0.001),將整體變異性降低2-3倍,並將稀有節律的每類變異性降低高達7.5倍(TRIGU範圍:4.1pp對比Kaiming下的30.9pp,經3倍交叉驗證獨立確認)。在n=20的四基比較中,所有結構化正交基都產生等效的性能(Friedman p=0.48),確立了貢獻是確定性的結構化初始化本身,而不是任何特定的基函數。對七個MedMNIST基準的跨領域驗證(n=20,所有p > 0.14)確認在標準任務上沒有性能懲罰;對不平衡任務(ChestMNIST,RetinaMNIST)的每類分析顯示在稀有類別上觀察到的變異性降低與ECG相同。對三個外部ECG數據庫的跨數據集評估確認零樣本泛化(>0.93 AFIB AUC)。

Towards Emotion Recognition with 3D Pointclouds Obtained from Facial Expression Images

2603.27798v1 by Laura Rayón Ropero, Jasper De Laet, Filip Lemic, Pau Sabater Nácher, Nabeel Nisar Bhat, Sergi Abadal, Jeroen Famaey, Eduard Alarcón, Xavier Costa-Pérez

Facial Emotion Recognition is a critical research area within Affective Computing due to its wide-ranging applications in Human Computer Interaction, mental health assessment and fatigue monitoring. Current FER methods predominantly rely on Deep Learning techniques trained on 2D image data, which pose significant privacy concerns and are unsuitable for continuous, real-time monitoring. As an alternative, we propose High-Frequency Wireless Sensing (HFWS) as an enabler of continuous, privacy-aware FER, through the generation of detailed 3D facial pointclouds via on-person sensors embedded in wearables. We present arguments supporting the privacy advantages of HFWS over traditional 2D imaging, particularly under increasingly stringent data protection regulations. A major barrier to adopting HFWS for FER is the scarcity of labeled 3D FER datasets. Towards addressing this issue, we introduce a FLAME-based method to generate 3D facial pointclouds from existing public 2D datasets. Using this approach, we create AffectNet3D, a 3D version of the AffectNet database. To evaluate the quality and usability of the generated data, we design a pointcloud refinement pipeline focused on isolating the facial region, and train the popular PointNet++ model on the refined pointclouds. Fine-tuning the model on a small subset of the unseen 3D FER dataset BU-3DFE yields a classification accuracy exceeding 70%, comparable to oracle-level performance. To further investigate the potential of HFWS-based FER for continuous monitoring, we simulate wearable sensing conditions by masking portions of the generated pointclouds. Experimental results show that models trained on AffectNet3D and fine-tuned with just 25% of BU-3DFE outperform those trained solely on BU-3DFE. These findings highlight the viability of our pipeline and support the feasibility of continuous, privacy-aware FER via wearable HFWS systems.

摘要:面部情感識別是情感計算中一個關鍵的研究領域,因為它在人體計算機互動、心理健康評估和疲勞監測等方面有著廣泛的應用。當前的面部情感識別方法主要依賴於基於2D圖像數據訓練的深度學習技術,這些技術存在顯著的隱私問題,且不適合持續的實時監測。作為替代方案,我們提出高頻無線感測(HFWS)作為持續、注重隱私的面部情感識別的實現方式,通過在可穿戴設備中嵌入的個人傳感器生成詳細的3D面部點雲。我們提出支持HFWS在隱私優勢方面優於傳統2D成像的論據,特別是在日益嚴格的數據保護法規下。採用HFWS進行面部情感識別的一個主要障礙是標記的3D面部情感識別數據集的稀缺。為了解決這一問題,我們引入了一種基於FLAME的方法,從現有的公共2D數據集中生成3D面部點雲。通過這種方法,我們創建了AffectNet3D,這是AffectNet數據庫的3D版本。為了評估生成數據的質量和可用性,我們設計了一個點雲精煉管道,專注於隔離面部區域,並在精煉後的點雲上訓練流行的PointNet++模型。在未見的3D面部情感識別數據集BU-3DFE的小子集上微調模型,得到了超過70%的分類準確率,與預測級別的表現相當。為了進一步研究基於HFWS的面部情感識別在持續監測中的潛力,我們通過遮罩生成的點雲的部分區域來模擬可穿戴感測條件。實驗結果顯示,在AffectNet3D上訓練並僅用25%的BU-3DFE進行微調的模型,表現優於僅在BU-3DFE上訓練的模型。這些發現突顯了我們的管道的可行性,並支持通過可穿戴HFWS系統實現持續、注重隱私的面部情感識別的可行性。

RAP: Retrieve, Adapt, and Prompt-Fit for Training-Free Few-Shot Medical Image Segmentation

2603.27705v1 by Zhihao Mao, Bangpu Chen

Few-shot medical image segmentation (FSMIS) has achieved notable progress, yet most existing methods mainly rely on semantic correspondences from scarce annotations while under-utilizing a key property of medical imagery: anatomical targets exhibit repeatable high-frequency morphology (e.g., boundary geometry and spatial layout) across patients and acquisitions. We propose RAP, a training-free framework that retrieves, adapts, and prompts Segment Anything Model 2 (SAM2) for FSMIS. First, RAP retrieves morphologically compatible supports from an archive using DINOv3 features to reduce brittleness in single-support choice. Second, it adapts the retrieved support mask to the query by fitting boundary-aware structural cues, yielding an anatomy-consistent pre-mask under domain shifts. Third, RAP converts the pre-mask into prompts by sampling positive points via Voronoi partitioning and negative points via sector-based sampling, and feeds them into SAM2 for final refinement without any fine-tuning. Extensive experiments on multiple medical segmentation benchmarks show that RAP consistently surpasses prior FSMIS baselines and achieves state-of-the-art performance. Overall, RAP demonstrates that explicit structural fitting combined with retrieval-augmented prompting offers a simple and effective route to robust training-free few-shot medical segmentation.

摘要:少樣本醫學影像分割(FSMIS)已取得顯著進展,然而大多數現有方法主要依賴於稀缺標註的語義對應,同時未充分利用醫學影像的一個關鍵特性:解剖目標在不同患者和獲取過程中展現可重複的高頻形態(例如,邊界幾何和空間佈局)。我們提出了RAP,一個無需訓練的框架,能夠檢索、適應並提示Segment Anything Model 2(SAM2)以進行FSMIS。首先,RAP使用DINOv3特徵從檔案中檢索形態上相容的支持,以減少單一支持選擇的脆弱性。其次,它通過擬合邊界感知的結構線索,將檢索到的支持掩模適應於查詢,從而在領域轉移下產生解剖一致的預掩模。第三,RAP通過Voronoi劃分抽樣正點和基於扇區的抽樣負點,將預掩模轉換為提示,並將其輸入SAM2進行最終精煉,而無需任何微調。在多個醫學分割基準上的廣泛實驗表明,RAP始終超越先前的FSMIS基準,並達到最先進的性能。總體而言,RAP展示了明確的結構擬合結合檢索增強提示提供了一條簡單有效的途徑,以實現穩健的無訓練少樣本醫學分割。

Project Imaging-X: A Survey of 1000+ Open-Access Medical Imaging Datasets for Foundation Model Development

2603.27460v1 by Zhongying Deng, Cheng Tang, Ziyan Huang, Jiashi Lin, Ying Chen, Junzhi Ning, Chenglong Ma, Jiyao Liu, Wei Li, Yinghao Zhu, Shujian Gao, Yanyan Huang, Sibo Ju, Yanzhou Su, Pengcheng Chen, Wenhao Tang, Tianbin Li, Haoyu Wang, Yuanfeng Ji, Hui Sun, Shaobo Min, Liang Peng, Feilong Tang, Haochen Xue, Rulin Zhou, Chaoyang Zhang, Wenjie Li, Shaohao Rui, Weijie Ma, Xingyue Zhao, Yibin Wang, Kun Yuan, Zhaohui Lu, Shujun Wang, Jinjie Wei, Lihao Liu, Dingkang Yang, Lin Wang, Yulong Li, Haolin Yang, Yiqing Shen, Lequan Yu, Xiaowei Hu, Yun Gu, Yicheng Wu, Benyou Wang, Minghui Zhang, Angelica I. Aviles-Rivero, Qi Gao, Hongming Shan, Xiaoyu Ren, Fang Yan, Hongyu Zhou, Haodong Duan, Maosong Cao, Shanshan Wang, Bin Fu, Xiaomeng Li, Zhi Hou, Chunfeng Song, Lei Bai, Yuan Cheng, Yuandong Pu, Xiang Li, Wenhai Wang, Hao Chen, Jiaxin Zhuang, Songyang Zhang, Huiguang He, Mengzhang Li, Bohan Zhuang, Zhian Bai, Rongshan Yu, Liansheng Wang, Yukun Zhou, Xiaosong Wang, Xin Guo, Guanbin Li, Xiangru Lin, Dakai Jin, Mianxin Liu, Wenlong Zhang, Qi Qin, Conghui He, Yuqiang Li, Ye Luo, Nanqing Dong, Jie Xu, Wenqi Shao, Bo Zhang, Qiujuan Yan, Yihao Liu, Jun Ma, Zhi Lu, Yuewen Cao, Zongwei Zhou, Jianming Liang, Shixiang Tang, Qi Duan, Dongzhan Zhou, Chen Jiang, Yuyin Zhou, Yanwu Xu, Jiancheng Yang, Shaoting Zhang, Xiaohong Liu, Siqi Luo, Yi Xin, Chaoyu Liu, Haochen Wen, Xin Chen, Alejandro Lozano, Min Woo Sun, Yuhui Zhang, Yue Yao, Xiaoxiao Sun, Serena Yeung-Levy, Xia Li, Jing Ke, Chunhui Zhang, Zongyuan Ge, Ming Hu, Jin Ye, Zhifeng Li, Yirong Chen, Yu Qiao, Junjun He

Foundation models have demonstrated remarkable success across diverse domains and tasks, primarily due to the thrive of large-scale, diverse, and high-quality datasets. However, in the field of medical imaging, the curation and assembling of such medical datasets are highly challenging due to the reliance on clinical expertise and strict ethical and privacy constraints, resulting in a scarcity of large-scale unified medical datasets and hindering the development of powerful medical foundation models. In this work, we present the largest survey to date of medical image datasets, covering over 1,000 open-access datasets with a systematic catalog of their modalities, tasks, anatomies, annotations, limitations, and potential for integration. Our analysis exposes a landscape that is modest in scale, fragmented across narrowly scoped tasks, and unevenly distributed across organs and modalities, which in turn limits the utility of existing medical image datasets for developing versatile and robust medical foundation models. To turn fragmentation into scale, we propose a metadata-driven fusion paradigm (MDFP) that integrates public datasets with shared modalities or tasks, thereby transforming multiple small data silos into larger, more coherent resources. Building on MDFP, we release an interactive discovery portal that enables end-to-end, automated medical image dataset integration, and compile all surveyed datasets into a unified, structured table that clearly summarizes their key characteristics and provides reference links, offering the community an accessible and comprehensive repository. By charting the current terrain and offering a principled path to dataset consolidation, our survey provides a practical roadmap for scaling medical imaging corpora, supporting faster data discovery, more principled dataset creation, and more capable medical foundation models.

摘要:基礎模型在多樣領域和任務中展現了顯著的成功,主要歸功於大規模、多樣且高質量數據集的蓬勃發展。然而,在醫學影像領域,這類醫學數據集的策劃和組建面臨極大的挑戰,因為這依賴於臨床專業知識以及嚴格的倫理和隱私限制,導致大規模統一醫學數據集的稀缺,並阻礙了強大醫學基礎模型的發展。在這項工作中,我們提出了迄今為止最大的醫學影像數據集調查,涵蓋了超過1,000個開放訪問數據集,並系統性地編目它們的模態、任務、解剖、註釋、限制和整合潛力。我們的分析揭示了一個規模適中、在狹窄任務範疇中碎片化、並在器官和模態之間分佈不均的現狀,這反過來限制了現有醫學影像數據集在開發多功能和穩健的醫學基礎模型中的效用。為了將碎片化轉變為規模,我們提出了一種基於元數據的融合範式(MDFP),該範式整合了具有共享模態或任務的公共數據集,從而將多個小數據孤島轉變為更大、更連貫的資源。基於MDFP,我們發布了一個互動式發現入口,實現端到端的自動醫學影像數據集整合,並將所有調查的數據集編輯成一個統一的結構化表格,清晰地總結其關鍵特徵並提供參考鏈接,為社群提供一個可訪問且全面的資料庫。通過描繪當前的地形並提供一條有原則的數據集整合路徑,我們的調查為擴大醫學影像語料庫提供了一個實用的路線圖,支持更快的數據發現、更有原則的數據集創建,以及更強大的醫學基礎模型。

Improving Automated Wound Assessment Using Joint Boundary Segmentation and Multi-Class Classification Models

2603.27325v1 by Mehedi Hasan Tusar, Fateme Fayyazbakhsh, Igor Melnychuk, Ming C. Leu

Accurate wound classification and boundary segmentation are essential for guiding clinical decisions in both chronic and acute wound management. However, most existing AI models are limited, focusing on a narrow set of wound types or performing only a single task (segmentation or classification), which reduces their clinical applicability. This study presents a deep learning model based on YOLOv11 that simultaneously performs wound boundary segmentation (WBS) and wound classification (WC) across five clinically relevant wound types: burn injury (BI), pressure injury (PI), diabetic foot ulcer (DFU), vascular ulcer (VU), and surgical wound (SW). A wound-type balanced dataset of 2,963 annotated images was created to train the models for both tasks, with stratified five-fold cross-validation ensuring robust and unbiased evaluation. The models trained on the original non-augmented dataset achieved consistent performance across folds, though BI detection accuracy was relatively lower. Therefore, the dataset was augmented using rotation, flipping, and variations in brightness, saturation, and exposure to help the model learn more generalized and invariant features. This augmentation significantly improved model performance, particularly in detecting visually subtle BI cases. Among tested variants, YOLOv11x achieved the highest performance with F1-scores of 0.9341 (WBS) and 0.8736 (WC), while the lightweight YOLOv11n provided comparable accuracy at lower computational cost, making it suitable for resource-constrained deployments. Supported by confusion matrices and visual detection outputs, the results confirm the model's robustness against complex backgrounds and high intra-class variability, demonstrating the potential of YOLOv11-based architectures for accurate, real-time wound analysis in both clinical and remote care settings.

摘要:準確的傷口分類和邊界分割對於指導慢性和急性傷口管理中的臨床決策至關重要。然而,大多數現有的人工智慧模型都有限,專注於狹窄的傷口類型或僅執行單一任務(分割或分類),這降低了它們的臨床適用性。本研究提出了一個基於YOLOv11的深度學習模型,能同時執行五種臨床相關傷口類型的傷口邊界分割(WBS)和傷口分類(WC):燒傷(BI)、壓力傷(PI)、糖尿病足潰瘍(DFU)、血管潰瘍(VU)和手術傷口(SW)。為了訓練這兩項任務的模型,創建了一個包含2,963張註釋圖像的傷口類型平衡數據集,並通過分層五折交叉驗證確保了穩健和無偏的評估。在原始未增強數據集上訓練的模型在各折中表現一致,儘管BI檢測的準確性相對較低。因此,通過旋轉、翻轉以及亮度、飽和度和曝光的變化來增強數據集,以幫助模型學習更通用和不變的特徵。這種增強顯著改善了模型的性能,特別是在檢測視覺上微妙的BI案例方面。在測試的變體中,YOLOv11x以0.9341(WBS)和0.8736(WC)的F1分數達到了最高性能,而輕量級的YOLOv11n在較低的計算成本下提供了可比的準確性,使其適合資源有限的部署。通過混淆矩陣和視覺檢測輸出支持,結果確認了模型在複雜背景和高類內變異性下的穩健性,展示了基於YOLOv11的架構在臨床和遠程護理環境中進行準確實時傷口分析的潛力。

Diagnosing and Repairing Unsafe Channels in Vision-Language Models via Causal Discovery and Dual-Modal Safety Subspace Projection

2603.27240v1 by Jinhu Fu, Yihang Lou, Qingyi Si, Shudong Zhang, Yan Bai, Sen Su

Large Vision-Language Models (LVLMs) have achieved impressive performance across multimodal understanding and reasoning tasks, yet their internal safety mechanisms remain opaque and poorly controlled. In this work, we present a comprehensive framework for diagnosing and repairing unsafe channels within LVLMs (CARE). We first perform causal mediation analysis to identify neurons and layers that are causally responsible for unsafe behaviors. Based on these findings, we introduce a dual-modal safety subspace projection method that learns generalized safety subspaces for both visual and textual modalities through generalized eigen-decomposition between benign and malicious activations. During inference, activations are dynamically projected toward these safety subspaces via a hybrid fusion mechanism that adaptively balances visual and textual corrections, effectively suppressing unsafe features while preserving semantic fidelity. Extensive experiments on multiple safety benchmarks demonstrate that our causal-subspace repair framework significantly enhances safety robustness without degrading general multimodal capabilities, outperforming prior activation steering and alignment-based baselines. Additionally, our method exhibits good transferability, defending against unseen attacks.

摘要:大型視覺-語言模型(LVLMs)在多模態理解和推理任務中取得了令人印象深刻的表現,但其內部安全機制仍然不透明且控制不佳。在這項工作中,我們提出了一個全面的框架,用於診斷和修復LVLM中的不安全通道(CARE)。我們首先進行因果中介分析,以識別對不安全行為負有因果責任的神經元和層。基於這些發現,我們引入了一種雙模態安全子空間投影方法,通過良性和惡性激活之間的廣義特徵分解來學習視覺和文本模態的廣義安全子空間。在推理過程中,激活通過一種混合融合機制動態投影到這些安全子空間,該機制自適應地平衡視覺和文本的修正,有效抑制不安全特徵,同時保持語義的真實性。在多個安全基準上的廣泛實驗表明,我們的因果子空間修復框架顯著增強了安全穩健性,而不會降低一般的多模態能力,超越了先前的激活引導和對齊基準。此外,我們的方法展現了良好的可轉移性,能夠抵禦未見攻擊。

MediHive: A Decentralized Agent Collective for Medical Reasoning

2603.27150v1 by Xiaoyang Wang, Christopher C. Yang

Large language models (LLMs) have revolutionized medical reasoning tasks, yet single-agent systems often falter on complex, interdisciplinary problems requiring robust handling of uncertainty and conflicting evidence. Multi-agent systems (MAS) leveraging LLMs enable collaborative intelligence, but prevailing centralized architectures suffer from scalability bottlenecks, single points of failure, and role confusion in resource-constrained environments. Decentralized MAS (D-MAS) promise enhanced autonomy and resilience via peer-to-peer interactions, but their application to high-stakes healthcare domains remains underexplored. We introduce MediHive, a novel decentralized multi-agent framework for medical question answering that integrates a shared memory pool with iterative fusion mechanisms. MediHive deploys LLM-based agents that autonomously self-assign specialized roles, conduct initial analyses, detect divergences through conditional evidence-based debates, and locally fuse peer insights over multiple rounds to achieve consensus. Empirically, MediHive outperforms single-LLM and centralized baselines on MedQA and PubMedQA datasets, attaining accuracies of 84.3% and 78.4%, respectively. Our work advances scalable, fault-tolerant D-MAS for medical AI, addressing key limitations of centralized designs while demonstrating superior performance in reasoning-intensive tasks.

摘要:大型語言模型(LLMs)已經徹底改變了醫療推理任務,但單一代理系統在處理需要強大不確定性和衝突證據的複雜跨學科問題時,往往表現不佳。利用LLMs的多代理系統(MAS)能夠實現協作智能,但現有的集中式架構在資源有限的環境中面臨可擴展性瓶頸、單點故障和角色混淆的問題。去中心化的多代理系統(D-MAS)通過點對點互動承諾增強自主性和韌性,但其在高風險醫療領域的應用仍然未得到充分探索。我們介紹了MediHive,一個新穎的去中心化多代理框架,用於醫療問題回答,該框架整合了共享記憶池和迭代融合機制。MediHive 部署了基於LLM的代理,這些代理能夠自主自我分配專業角色,進行初步分析,通過條件證據辯論檢測分歧,並在多輪中本地融合同伴見解以達成共識。實證結果表明,MediHive在MedQA和PubMedQA數據集上的表現優於單一LLM和集中基準,分別達到84.3%和78.4%的準確率。我們的工作推進了可擴展、容錯的D-MAS在醫療AI中的應用,解決了集中設計的關鍵限制,同時在推理密集型任務中展示了卓越的性能。

Bayes-MICE: A Bayesian Approach to Multiple Imputation for Time Series Data

2603.27142v1 by Amuche Ibenegbu, Pierre Lafaye de Micheaux, Rohitash Chandra

Time-series analysis is often affected by missing data, a common problem across several fields, including healthcare and environmental monitoring. Multiple Imputation by Chained Equations (MICE) has been prominent for imputing missing values through "fully conditional specification". We extend MICE using the Bayesian framework (Bayes-MICE), utilising Bayesian inference to impute missing values via Markov Chain Monte Carlo (MCMC) sampling to account for uncertainty in MICE model parameters and imputed values. We also include temporally informed initialisation and time-lagged features in the model to respect the sequential nature of time-series data. We evaluate the Bayes-MICE method using two real-world datasets (AirQuality and PhysioNet), and using both the Random Walk Metropolis (RWM) and the Metropolis-Adjusted Langevin Algorithm (MALA) samplers. Our results demonstrate that Bayes-MICE reduces imputation errors relative to the baseline methods over all variables and accounts for uncertainty in the imputation process, thereby providing a more accurate measure of imputation error. We also found that MALA converges faster than RWM, achieving comparable accuracy while providing more consistent posterior exploration. Overall, these findings suggest that the Bayes-MICE framework represents a practical and efficient approach to time-series imputation, balancing increased accuracy with meaningful quantification of uncertainty in various environmental and clinical settings.

摘要:時間序列分析常常受到缺失資料的影響,這是多個領域中的常見問題,包括醫療保健和環境監測。多重插補鏈式方程(MICE)在通過「完全條件規範」插補缺失值方面非常重要。我們使用貝葉斯框架擴展MICE(Bayes-MICE),利用貝葉斯推斷通過馬爾可夫鏈蒙特卡羅(MCMC)抽樣來插補缺失值,以考慮MICE模型參數和插補值的不確定性。我們還在模型中包含了時間信息初始化和時間滯後特徵,以尊重時間序列數據的序列性質。我們使用兩個真實世界數據集(AirQuality和PhysioNet)來評估Bayes-MICE方法,並使用隨機漫步梅特羅波利斯(RWM)和梅特羅波利斯調整的朗之萬算法(MALA)抽樣器。我們的結果表明,Bayes-MICE在所有變數上相對於基準方法減少了插補誤差,並考慮了插補過程中的不確定性,從而提供了更準確的插補誤差測量。我們還發現MALA的收斂速度比RWM更快,實現了可比的準確性,同時提供了更一致的後驗探索。總體而言,這些發現表明,Bayes-MICE框架代表了一種實用且高效的時間序列插補方法,在各種環境和臨床設置中平衡了提高的準確性與不確定性的有意義量化。

Autonomous Agent-Orchestrated Digital Twins (AADT): Leveraging the OpenClaw Framework for State Synchronization in Rare Genetic Disorders

2603.27104v1 by Hongzhuo Chen, Zhanliang Wang, Quan M. Nguyen, Gongbo Zhang, Chunhua Weng, Kai Wang

Background: Medical Digital Twins (MDTs) are computational representations of individual patients that integrate clinical, genomic, and physiological data to support diagnosis, treatment planning, and outcome prediction. However, most MDTs remain static or passively updated, creating a critical synchronization gap, especially in rare genetic disorders where phenotypes, genomic interpretations, and care guidelines evolve over time. Methods: We propose an agent-orchestrated digital twin framework using OpenClaw's proactive "heartbeat" mechanism and modular Agent Skills. This Autonomous Agent-orchestrated Digital Twin (AADT) system continuously monitors local and external data streams (e.g., patient-reported phenotypes and updates in variant classification databases) and executes automated workflows for data ingestion, normalization, state updates, and trigger-based analysis. Results: A prototype implementation demonstrates that agent orchestration can continuously synchronize MDT states with both longitudinal phenotype updates and evolving genomic knowledge. In rare disease settings, this enables earlier diagnosis and more accurate modeling of disease progression. We present two case studies, including variant reinterpretation and longitudinal phenotype tracking, highlighting how AADTs support timely, auditable updates for both research and clinical care. Conclusion: The AADT framework addresses the key bottleneck of real-time synchronization in MDTs, enabling scalable and continuously updated patient models. We also discuss data security considerations and mitigation strategies through human-in-the-loop system design.

摘要:背景:醫療數位雙胞胎(MDTs)是個別病人的計算表示,整合臨床、基因組和生理數據,以支持診斷、治療計劃和結果預測。然後,大多數MDTs仍然是靜態或被動更新,造成了關鍵的同步差距,特別是在罕見遺傳疾病中,表型、基因組解釋和護理指導隨時間演變。
方法:我們提出了一個代理協調的數位雙胞胎框架,使用OpenClaw的主動“心跳”機制和模組化的代理技能。這個自主代理協調的數位雙胞胎(AADT)系統持續監控本地和外部數據流(例如,病人報告的表型和變異分類數據庫的更新),並執行自動化工作流程以進行數據攝取、標準化、狀態更新和基於觸發的分析。
結果:原型實現展示了代理協調如何持續同步MDT狀態,與縱向表型更新和不斷演變的基因組知識相結合。在罕見疾病環境中,這使得更早的診斷和更準確的疾病進展建模成為可能。我們呈現了兩個案例研究,包括變異重新解釋和縱向表型追蹤,突顯AADTs如何支持及時、可審計的更新,無論是對於研究還是臨床護理。
結論:AADT框架解決了MDTs中實時同步的關鍵瓶頸,使可擴展且持續更新的病人模型成為可能。我們還討論了數據安全考量和通過人機協作系統設計的緩解策略。

When Perplexity Lies: Generation-Focused Distillation of Hybrid Sequence Models

2603.26556v1 by Juan Gabriel Kostelec, Xiang Wang, Axel Laborieux, Christos Sourmpis, Qinghai Guo

Converting a pretrained Transformer into a more efficient hybrid model through distillation offers a promising approach to reducing inference costs. However, achieving high-quality generation in distilled models requires careful joint design of both the student architecture and the distillation process. Many prior distillation works evaluate downstream multiple-choice benchmarks by ranking candidate answers with log-likelihood rather than requiring autoregressive generation, which can obscure important differences in model quality. For example, we show that a 7B parameter distilled model that nearly matches its teacher to within 0.2\,pp under log-likelihood scoring actually falls behind by 20.8\,pp when the model must generate answers autoregressively. We propose a Hybrid Kimi Delta Attention (Hybrid-KDA) architecture paired with GenDistill, a multi-stage distillation pipeline, and use generation-based evaluation throughout to guide design decisions. Applying this approach to Qwen3-0.6B, we systematically ablate six design axes: training objective, loss masking, training duration, dataset selection, parameter freezing, and architecture choice. We find that log-likelihood-based evaluation consistently underestimates the gap between teacher and student, and can in some cases reverse the ranking of design choices, meaning that conclusions drawn from perplexity-only evaluation may be misleading. Among the factors we study, dataset selection, completion-only masking, and freezing attention layers during post-training have the largest impact on generation quality. Our best Hybrid-KDA model retains 86--90\% of teacher accuracy on knowledge benchmarks while reducing KV cache memory by up to 75\% and improving time-to-first-token by 2--4$\times$ at 128K-token contexts.

摘要:將預訓練的 Transformer 轉換為更高效的混合模型通過蒸餾提供了一種減少推理成本的有前景的方法。然而,在蒸餾模型中實現高質量生成需要學生架構和蒸餾過程的仔細聯合設計。許多先前的蒸餾工作通過使用對數似然對候選答案進行排名來評估下游的多選基準,而不是要求自回歸生成,這可能會掩蓋模型質量的重要差異。例如,我們展示了一個 7B 參數的蒸餾模型,在對數似然評分下幾乎與其教師匹配到 0.2\,pp,但當模型必須自回歸生成答案時,實際上落後 20.8\,pp。我們提出了一種混合 Kimi Delta 注意力(Hybrid-KDA)架構,搭配 GenDistill,一個多階段的蒸餾管道,並在整個過程中使用基於生成的評估來指導設計決策。將這種方法應用於 Qwen3-0.6B,我們系統性地消融了六個設計軸:訓練目標、損失遮蔽、訓練持續時間、數據集選擇、參數凍結和架構選擇。我們發現基於對數似然的評估始終低估了教師和學生之間的差距,在某些情況下甚至會顛倒設計選擇的排名,這意味著僅基於困惑度的評估得出的結論可能會誤導。在我們研究的因素中,數據集選擇、僅完成遮蔽以及在後訓練期間凍結注意力層對生成質量的影響最大。我們最好的 Hybrid-KDA 模型在知識基準上保留了 86--90\% 的教師準確性,同時將 KV 緩存記憶減少了最多 75\%,並在 128K 令牌上下文中將首次令牌的時間提高了 2--4$\times$。

Foundation Model for Cardiac Time Series via Masked Latent Attention

2603.26475v1 by Moritz Vandenhirtz, Samuel Ruipérez-Campillo, Simon Böhi, Sonia Laguna, Irene Cannistraci, Andrea Agostini, Ece Ozkan, Thomas M. Sutter, Julia E. Vogt

Electrocardiograms (ECGs) are among the most widely available clinical signals and play a central role in cardiovascular diagnosis. While recent foundation models (FMs) have shown promise for learning transferable ECG representations, most existing pretraining approaches treat leads as independent channels and fail to explicitly leverage their strong structural redundancy. We introduce the latent attention masked autoencoder (LAMAE) FM that directly exploits this structure by learning cross-lead connection mechanisms during self-supervised pretraining. Our approach models higher-order interactions across leads through latent attention, enabling permutation-invariant aggregation and adaptive weighting of lead-specific representations. We provide empirical evidence on the Mimic-IV-ECG database that leveraging the cross-lead connection constitutes an effective form of structural supervision, improving representation quality and transferability. Our method shows strong performance in predicting ICD-10 codes, outperforming independent-lead masked modeling and alignment-based baselines.

摘要:心電圖(ECG)是最廣泛可用的臨床信號之一,並在心血管診斷中扮演核心角色。儘管最近的基礎模型(FM)在學習可轉移的ECG表示方面顯示出潛力,但大多數現有的預訓練方法將導聯視為獨立通道,未能明確利用其強大的結構冗餘。我們引入了潛在注意力遮罩自編碼器(LAMAE)FM,通過在自監督預訓練期間學習導聯之間的連接機制,直接利用這一結構。我們的方法通過潛在注意力建模導聯之間的高階交互,實現了排列不變的聚合和導聯特定表示的自適應加權。我們在Mimic-IV-ECG數據庫上提供了實證證據,表明利用導聯之間的連接構成了一種有效的結構監督形式,提高了表示質量和可轉移性。我們的方法在預測ICD-10代碼方面表現出色,超越了獨立導聯遮罩建模和基於對齊的基準。

PRISMA: Toward a Normative Information Infrastructure for Responsible Pharmaceutical Knowledge Management

2603.26324v1 by Eugenio Rodrigo Zimmer Neves, Amanda Vanon Correa, Camila Campioni, Gabielli Pare Guglielmi, Bruno Morelli

Most existing approaches to AI in pharmacy collapse three epistemologically distinct operations into a single technical layer: document preservation, semantic interpretation, and contextual presentation. This conflation is a root cause of recurring fragilities including loss of provenance, interpretive opacity, alert fatigue, and erosion of accountability. This paper proposes the PATOS--Lector--PRISMA (PLP) infrastructure as a normative information architecture for responsible pharmaceutical knowledge management. PATOS preserves regulatory documents with explicit versioning and provenance; Lector implements machine-assisted reading with human curation, producing typed assertions anchored to primary sources; PRISMA delivers contextual presentation through the RPDA framework (Regulatory, Prescription, Dispensing, Administration), refracting the same informational core into distinct professional views. The architecture introduces the Evidence Pack as a formal unit of accountable assertion (versioned, traceable, epistemically bounded, and curatorially validated), with assertions typified by illocutionary force. A worked example traces dipyrone monohydrate across all three layers using real system data. Developed and validated in Brazil's regulatory context, the architecture is grounded in an operational implementation comprising over 16,000 official documents and 38 curated Evidence Packs spanning five reference medications. The proposal is demonstrated as complementary to operational decision support systems, providing infrastructural conditions that current systems lack: documentary anchoring, interpretive transparency, and institutional accountability.

摘要:大多數現有的藥學人工智慧方法將三個在認識論上明顯不同的操作合併為一個單一的技術層:文件保存、語義解釋和上下文呈現。這種混淆是導致反覆出現的脆弱性的根本原因,包括來源丟失、解釋不明、警報疲勞和問責制侵蝕。本文提出PATOS--Lector--PRISMA (PLP) 基礎設施作為負責任的藥學知識管理的規範性信息架構。PATOS以明確的版本控制和來源保存監管文件;Lector實施人機協作的閱讀,產生與主要來源相連的類型化斷言;PRISMA通過RPDA框架(監管、處方、配藥、管理)提供上下文呈現,將相同的信息核心折射成不同的專業視角。該架構引入了證據包作為一個正式的可問責斷言單位(版本化、可追溯、認識論界定且經過策展驗證),其斷言以言外之意的強度為特徵。一個實例追踪了單硫酸二氫鈉在所有三個層面上的應用,使用真實系統數據。在巴西的監管背景下開發和驗證,該架構基於一個操作實施,包含超過16,000份官方文件和38個策展的證據包,涵蓋五種參考藥物。該提案被證明是對操作決策支持系統的補充,提供了當前系統所缺乏的基礎設施條件:文件錨定、解釋透明度和機構問責制。

Progressive Learning with Anatomical Priors for Reliable Left Atrial Scar Segmentation from Late Gadolinium Enhancement MRI

2603.26186v1 by Jing Zhang, Bastien Bergere, Emilie Bollache, Jonas Leite, Mikaël Laredo, Alban Redheuil, Nadjia Kachenoura

Cardiac MRI late gadolinium enhancement (LGE) enables non-invasive identification of left atrial (LA) scar, whose spatial distribution is strongly associated with atrial fibrillation (AF) severity and recurrence. However, automatic LA scar segmentation remains challenging due to low contrast, annotation variability, and the lack of anatomical constraints, often leading to non-reliable predictions. Accordingly, our aim was to propose a progressive learning strategy to segment LA scar from LGE images inspired from a clinical workflow. A 3-stage framework based on SwinUNETR was implemented, comprising: 1) a first LA cavity pre-learning model, 2) dual-task model which further learns spatial relationship between LA geometry and scar patterns, and 3) fine-tuning on precise segmentation of the scar. Furthermore, we introduced an anatomy-aware spatially weighted loss that incorporates prior clinical knowledge by constraining scar predictions to anatomically plausible LA wall regions while mitigating annotation bias. Our preliminary results obtained on validation LGE volumes from LASCARQS public dataset after 5-fold cross validation, LA segmentation had Dice score of 0.94, LA scar segmentation achieved Dice score of 0.50, Hausdorff Distance of 11.84 mm, Average Surface Distance of 1.80 mm, outperforming only a one-stage scar segmentation with 0.49, 13.02 mm, 1.96 mm, repectively. By explicitly embedding clinical anatomical priors and diagnostic reasoning into deep learning, the proposed approach improved the accuracy and reliability of LA scar segmentation from LGE, revealing the importance of clinically informed model design.

摘要:心臟 MRI 晚期鉺增強 (LGE) 能夠非侵入性地識別左心房 (LA) 瘢痕,其空間分佈與心房顫動 (AF) 的嚴重程度和復發密切相關。然而,由於對比度低、標註變異性以及缺乏解剖約束,自動化的 LA 瘢痕分割仍然具有挑戰性,這常常導致不可靠的預測。因此,我們的目標是提出一種漸進式學習策略,從 LGE 圖像中分割 LA 瘢痕,靈感來自臨床工作流程。我們實現了一個基於 SwinUNETR 的三階段框架,包括:1) 第一個 LA 腔體預學習模型,2) 雙任務模型進一步學習 LA 幾何形狀與瘢痕模式之間的空間關係,以及 3) 對瘢痕的精確分割進行微調。此外,我們引入了一種解剖學意識的空間加權損失,通過將瘢痕預測約束於解剖上合理的 LA 壁區域,同時減輕標註偏差,來融入先前的臨床知識。我們在 LASCARQS 公共數據集中經過 5 倍交叉驗證後獲得的初步結果顯示,LA 分割的 Dice 分數為 0.94,LA 瘢痕分割的 Dice 分數為 0.50,Hausdorff 距離為 11.84 mm,平均表面距離為 1.80 mm,分別優於僅有一階段瘢痕分割的 0.49、13.02 mm 和 1.96 mm。通過明確地將臨床解剖先驗和診斷推理嵌入深度學習中,所提出的方法提高了從 LGE 中進行 LA 瘢痕分割的準確性和可靠性,揭示了臨床知情模型設計的重要性。

SkinGPT-X: A Self-Evolving Collaborative Multi-Agent System for Transparent and Trustworthy Dermatological Diagnosis

2603.26122v1 by Zhangtianyi Chen, Yuhao Shen, Florensia Widjaja, Yan Xu, Liyuan Sun, Zijian Wang, Hongyi Chen, Wufei Dai, Juexiao Zhou

While recent advancements in Large Language Models have significantly advanced dermatological diagnosis, monolithic LLMs frequently struggle with fine-grained, large-scale multi-class diagnostic tasks and rare skin disease diagnosis owing to training data sparsity, while also lacking the interpretability and traceability essential for clinical reasoning. Although multi-agent systems can offer more transparent and explainable diagnostics, existing frameworks are primarily concentrated on Visual Question Answering and conversational tasks, and their heavy reliance on static knowledge bases restricts adaptability in complex real-world clinical settings. Here, we present SkinGPT-X, a multimodal collaborative multi-agent system for dermatological diagnosis integrated with a self-evolving dermatological memory mechanism. By simulating the diagnostic workflow of dermatologists and enabling continuous memory evolution, SkinGPT-X delivers transparent and trustworthy diagnostics for the management of complex and rare dermatological cases. To validate the robustness of SkinGPT-X, we design a three-tier comparative experiment. First, we benchmark SkinGPT-X against four state-of-the-art LLMs across four public datasets, demonstrating its state-of-the-art performance with a +9.6% accuracy improvement on DDI31 and +13% weighted F1 gain on Dermnet over the state-of-the-art model. Second, we construct a large-scale multi-class dataset covering 498 distinct dermatological categories to evaluate its fine-grained classification capabilities. Finally, we curate the rare skin disease dataset, the first benchmark to address the scarcity of clinical rare skin diseases which contains 564 clinical samples with eight rare dermatological diseases. On this dataset, SkinGPT-X achieves a +9.8% accuracy improvement, a +7.1% weighted F1 improvement, a +10% Cohen's Kappa improvement.

摘要:雖然近期在大型語言模型方面的進展顯著推進了皮膚科診斷,但單一的 LLM 在細粒度、大規模多類別診斷任務和罕見皮膚疾病診斷方面經常面臨困難,這主要是由於訓練數據的稀疏性,同時也缺乏臨床推理所需的可解釋性和可追溯性。儘管多代理系統可以提供更透明和可解釋的診斷,但現有框架主要集中在視覺問答和對話任務上,並且對靜態知識庫的高度依賴限制了其在複雜現實臨床環境中的適應性。在此,我們提出了 SkinGPT-X,一個多模態協作多代理系統,專為皮膚科診斷而設,並整合了自我演化的皮膚科記憶機制。通過模擬皮膚科醫生的診斷工作流程並實現持續的記憶演變,SkinGPT-X 提供了透明且值得信賴的診斷,以管理複雜和罕見的皮膚科病例。為了驗證 SkinGPT-X 的穩健性,我們設計了一個三級比較實驗。首先,我們將 SkinGPT-X 與四個最先進的 LLM 在四個公共數據集上進行基準測試,顯示其在 DDI31 上的準確率提高了 +9.6%,在 Dermnet 上的加權 F1 分數提高了 +13%。其次,我們構建了一個涵蓋 498 種不同皮膚科類別的大型多類別數據集,以評估其細粒度分類能力。最後,我們整理了罕見皮膚疾病數據集,這是首個針對臨床罕見皮膚疾病稀缺問題的基準,包含 564 份臨床樣本,涵蓋八種罕見皮膚病。在這個數據集上,SkinGPT-X 實現了 +9.8% 的準確率提高,+7.1% 的加權 F1 提高,以及 +10% 的 Cohen's Kappa 提高。

Seeing Like Radiologists: Context- and Gaze-Guided Vision-Language Pretraining for Chest X-rays

2603.26049v1 by Kang Liu, Zhuoqi Ma, Siyu Liang, Yunan Li, Xiyue Gao, Chao Liang, Kun Xie, Qiguang Miao

Despite recent advances in medical vision-language pretraining, existing models still struggle to capture the diagnostic workflow: radiographs are typically treated as context-agnostic images, while radiologists' gaze -- a crucial cue for visual reasoning -- remains largely underexplored by existing methods. These limitations hinder the modeling of disease-specific patterns and weaken cross-modal alignment. To bridge this gap, we introduce CoGaze, a Context- and Gaze-guided vision-language pretraining framework for chest X-rays. We first propose a context-infused vision encoder that models how radiologists integrate clinical context -- including patient history, symptoms, and diagnostic intent -- to guide diagnostic reasoning. We then present a multi-level supervision paradigm that (1) enforces intra- and inter-modal semantic alignment through hybrid-positive contrastive learning, (2) injects diagnostic priors via disease-aware cross-modal representation learning, and (3) leverages radiologists' gaze as probabilistic priors to guide attention toward diagnostically salient regions. Extensive experiments demonstrate that CoGaze consistently outperforms state-of-the-art methods across diverse tasks, achieving up to +2.0% CheXbertF1 and +1.2% BLEU2 for free-text and structured report generation, +23.2% AUROC for zero-shot classification, and +12.2% Precision@1 for image-text retrieval. Code is available at https://github.com/mk-runner/CoGaze.

摘要:儘管最近在醫學視覺-語言預訓練方面取得了進展,但現有模型仍然難以捕捉診斷工作流程:放射線片通常被視為與上下文無關的圖像,而放射科醫師的注視—這一對視覺推理至關重要的線索—在現有方法中仍然大多未被探索。這些限制阻礙了對特定疾病模式的建模,並削弱了跨模態的對齊。為了填補這一空白,我們提出了CoGaze,一種針對胸部X光片的上下文和注視引導的視覺-語言預訓練框架。我們首先提出了一種上下文融入的視覺編碼器,該編碼器建模放射科醫師如何整合臨床上下文—包括病史、症狀和診斷意圖—以指導診斷推理。然後,我們提出了一種多層次的監督範式,該範式 (1) 通過混合正對比學習強化內部和跨模態的語義對齊,(2) 通過疾病感知的跨模態表示學習注入診斷先驗,(3) 利用放射科醫師的注視作為概率先驗,引導注意力集中在診斷上重要的區域。大量實驗表明,CoGaze在各種任務中始終優於最先進的方法,在自由文本和結構化報告生成中達到高達 +2.0% 的 CheXbertF1 和 +1.2% 的 BLEU2,在零樣本分類中達到 +23.2% 的 AUROC,以及在圖像-文本檢索中達到 +12.2% 的 Precision@1。代碼可在 https://github.com/mk-runner/CoGaze 獲得。

Unlabeled Cross-Center Automatic Analysis for TAAD: An Integrated Framework from Segmentation to Clinical Features

2603.26019v1 by Mengdi Liu, Qiang Li, Weizhi Nie, Shaopeng Zhang, Yuting Su

Type A Aortic Dissection (TAAD) is a life-threatening cardiovascular emergency that demands rapid and precise preoperative evaluation. While key anatomical and pathological features are decisive for surgical planning, current research focuses predominantly on improving segmentation accuracy, leaving the reliable, quantitative extraction of clinically actionable features largely under-explored. Furthermore, constructing comprehensive TAAD datasets requires labor-intensive, expert level pixel-wise annotations, which is impractical for most clinical institutions. Due to significant domain shift, models trained on a single center dataset also suffer from severe performance degradation during cross-institutional deployment. This study addresses a clinically critical challenge: the accurate extraction of key TAAD clinical features during cross-institutional deployment in the total absence of target-domain annotations. To this end, we propose an unsupervised domain adaptation (UDA)-driven framework for the automated extraction of TAAD clinical features. The framework leverages limited source-domain labels while effectively adapting to unlabeled data from target domains. Tailored for real-world emergency workflows, our framework aims to achieve stable cross-institutional multi-class segmentation, reliable and quantifiable clinical feature extraction, and practical deployability independent of high-cost annotations. Extensive experiments demonstrate that our method significantly improves cross-domain segmentation performance compared to existing state-of-the-art approaches. More importantly, a reader study involving multiple cardiovascular surgeons confirms that the automatically extracted clinical features provide meaningful assistance for preoperative assessment, highlighting the practical utility of the proposed end-to-end segmentation-to-feature pipeline.

摘要:型 A 主動脈剝離 (TAAD) 是一種危及生命的心血管緊急情況,要求快速且精確的術前評估。雖然關鍵的解剖和病理特徵對於手術規劃至關重要,但當前的研究主要集中在提高分割準確性上,導致臨床可行特徵的可靠、定量提取仍然未被充分探索。此外,構建全面的 TAAD 數據集需要耗時的專家級像素級註釋,這對於大多數臨床機構來說是不切實際的。由於顯著的領域轉移,基於單一中心數據集訓練的模型在跨機構部署期間也會遭遇嚴重的性能下降。本研究針對一個臨床關鍵挑戰:在完全缺乏目標域註釋的情況下,準確提取關鍵的 TAAD 臨床特徵。為此,我們提出了一個基於無監督領域適應 (UDA) 的框架,用於自動提取 TAAD 臨床特徵。該框架利用有限的源域標籤,同時有效地適應來自目標域的未標記數據。我們的框架專為現實世界的緊急工作流程量身定制,旨在實現穩定的跨機構多類別分割、可靠且可量化的臨床特徵提取,以及獨立於高成本註釋的實用部署。廣泛的實驗表明,我們的方法在跨域分割性能上顯著優於現有的最先進方法。更重要的是,涉及多位心血管外科醫生的讀者研究確認,自動提取的臨床特徵對術前評估提供了有意義的幫助,突顯了所提議的端到端分割到特徵管道的實用性。

FairLLaVA: Fairness-Aware Parameter-Efficient Fine-Tuning for Large Vision-Language Assistants

2603.26008v1 by Mahesh Bhosale, Abdul Wasi, Shantam Srivastava, Shifa Latif, Tianyu Luan, Mingchen Gao, David Doermann, Xuan Gong

While powerful in image-conditioned generation, multimodal large language models (MLLMs) can display uneven performance across demographic groups, highlighting fairness risks. In safety-critical clinical settings, such disparities risk producing unequal diagnostic narratives and eroding trust in AI-assisted decision-making. While fairness has been studied extensively in vision-only and language-only models, its impact on MLLMs remains largely underexplored. To address these biases, we introduce FairLLaVA, a parameter-efficient fine-tuning method that mitigates group disparities in visual instruction tuning without compromising overall performance. By minimizing the mutual information between target attributes, FairLLaVA regularizes the model's representations to be demographic-invariant. The method can be incorporated as a lightweight plug-in, maintaining efficiency with low-rank adapter fine-tuning, and provides an architecture-agnostic approach to fair visual instruction following. Extensive experiments on large-scale chest radiology report generation and dermoscopy visual question answering benchmarks show that FairLLaVA consistently reduces inter-group disparities while improving both equity-scaled clinical performance and natural language generation quality across diverse medical imaging modalities. Code can be accessed at https://github.com/bhosalems/FairLLaVA.

摘要:雖然在圖像條件生成方面具有強大能力,多模態大型語言模型(MLLMs)在不同人口群體之間的表現卻可能不均衡,凸顯了公平風險。在安全至關重要的臨床環境中,這種差異可能導致不平等的診斷敘事,並侵蝕對AI輔助決策的信任。儘管公平性在僅限於視覺和僅限於語言的模型中已被廣泛研究,但其對MLLMs的影響仍然大多未被探討。為了解決這些偏見,我們引入了FairLLaVA,一種參數高效的微調方法,能在不妥協整體性能的情況下減輕視覺指令調整中的群體差異。通過最小化目標屬性之間的互信息,FairLLaVA使模型的表示變得與人口統計無關。該方法可以作為輕量級插件納入,保持低秩適配器微調的效率,並提供一種與架構無關的公平視覺指令跟隨方法。在大規模胸部放射報告生成和皮膚鏡視覺問題回答基準上的廣泛實驗表明,FairLLaVA持續減少群體間的差異,同時提高各種醫學影像模式下的公平性縮放臨床表現和自然語言生成質量。代碼可在 https://github.com/bhosalems/FairLLaVA 獲取。

Longitudinal Boundary Sharpness Coefficient Slopes Predict Time to Alzheimer's Disease Conversion in Mild Cognitive Impairment: A Survival Analysis Using the ADNI Cohort

2603.26007v1 by Ishaan Cherukuri

Predicting whether someone with mild cognitive impairment (MCI) will progress to Alzheimer's disease (AD) is crucial in the early stages of neurodegeneration. This uncertainty limits enrollment in clinical trials and delays urgent treatment. The Boundary Sharpness Coefficient (BSC) measures how well-defined the gray-white matter boundary looks on structural MRI. This study measures how BSC changes over time, namely, how fast the boundary degrades each year works much better than looking at a single baseline scan for predicting MCI-to-AD conversion. This study analyzed 1,824 T1-weighted MRI scans from 450 ADNI subjects (95 converters, 355 stable; mean follow-up: 4.84 years). BSC voxel-wise maps were computed using tissue segmentation at the gray-white matter cortical ribbon. Previous studies have used CNN and RNN models that reached 96.0% accuracy for AD classification and 84.2% for MCI conversion, but those approaches disregard specific regions within the brain. This study focused specifically on the gray-white matter interface. The approach uses temporal slope features capturing boundary degradation rates, feeding them into Random Survival Forest, a non-parametric ensemble method for right-censored survival data. The Random Survival Forest trained on BSC slopes achieved a test C-index of 0.63, a 163% improvement over baseline parametric models (test C-index: 0.24). Structural MRI costs a fraction of PET imaging ($800--$1,500 vs. $5,000--$7,000) and does not require CSF collection. These temporal biomarkers could help with patient-centered safety screening as well as risk assessment.

摘要:預測輕度認知障礙(MCI)患者是否會進展為阿茲海默症(AD)在神經退行性疾病的早期階段至關重要。這種不確定性限制了臨床試驗的入組並延遲了緊急治療。邊界銳利度係數(BSC)衡量結構性MRI中灰白質邊界的清晰程度。這項研究測量了BSC隨時間的變化,即邊界每年退化的速度,這比僅僅查看單一基線掃描在預測MCI轉換為AD方面要好得多。這項研究分析了來自450名ADNI受試者的1,824個T1加權MRI掃描(95名轉換者,355名穩定者;平均隨訪:4.84年)。BSC體素級地圖是通過在灰白質皮質帶進行組織分割計算得出的。先前的研究使用了CNN和RNN模型,達到了96.0%的AD分類準確率和84.2%的MCI轉換準確率,但這些方法忽略了大腦內的特定區域。這項研究特別關注灰白質界面。該方法使用捕捉邊界退化速率的時間斜率特徵,並將其輸入隨機生存森林(Random Survival Forest),這是一種針對右截尾生存數據的非參數集成方法。基於BSC斜率訓練的隨機生存森林達到了0.63的測試C指數,較基線參數模型(測試C指數:0.24)提高了163%。結構性MRI的成本僅為PET成像的一小部分($800--$1,500對比$5,000--$7,000),且不需要收集腦脊液。這些時間生物標記可能有助於以患者為中心的安全篩查以及風險評估。

When Chain-of-Thought Backfires: Evaluating Prompt Sensitivity in Medical Language Models

2603.25960v1 by Binesh Sadanandan, Vahid Behzadan

Large Language Models (LLMs) are increasingly deployed in medical settings, yet their sensitivity to prompt formatting remains poorly characterized. We evaluate MedGemma (4B and 27B parameters) on MedMCQA (4,183 questions) and PubMedQA (1,000 questions) across a broad suite of robustness tests. Our experiments reveal several concerning findings. Chain-of-Thought (CoT) prompting decreases accuracy by 5.7% compared to direct answering. Few-shot examples degrade performance by 11.9% while increasing position bias from 0.14 to 0.47. Shuffling answer options causes the model to change predictions 59.1% of the time, with accuracy dropping up to 27.4 percentage points. Front-truncating context to 50% causes accuracy to plummet below the no-context baseline, yet back-truncation preserves 97% of full-context accuracy. We further show that cloze scoring (selecting the highest log-probability option token) achieves 51.8% (4B) and 64.5% (27B), surpassing all prompting strategies and revealing that models "know" more than their generated text shows. Permutation voting recovers 4 percentage points over single-ordering inference. These results demonstrate that prompt engineering techniques validated on general-purpose models do not transfer to domain-specific medical LLMs, and that reliable alternatives exist.

摘要:大型語言模型(LLMs)在醫療環境中的應用日益增多,但它們對提示格式的敏感性仍然缺乏充分的特徵描述。我們在一系列穩健性測試中評估了 MedGemma(4B 和 27B 參數)在 MedMCQA(4,183 題問題)和 PubMedQA(1,000 題問題)上的表現。我們的實驗揭示了幾個令人擔憂的發現。思維鏈(CoT)提示的準確性比直接回答降低了 5.7%。少量示例使性能下降了 11.9%,同時將位置偏見從 0.14 增加到 0.47。打亂答案選項導致模型在 59.1% 的情況下改變預測,準確率下降最多達 27.4 個百分點。將上下文前置截斷到 50% 使準確率跌至低於無上下文基準,而後置截斷則保留了 97% 的全上下文準確率。我們進一步顯示,填空評分(選擇最高對數概率選項標記)在 4B 中達到 51.8% 和在 27B 中達到 64.5%,超越了所有提示策略,並揭示模型“知道”的信息超過其生成文本所顯示的。排列投票在單次排序推理上恢復了 4 個百分點。這些結果表明,針對通用模型驗證的提示工程技術並不適用於特定領域的醫療 LLM,並且存在可靠的替代方案。

Methods for Knowledge Graph Construction from Text Collections: Development and Applications

2603.25862v1 by Vanni Zavarella

Virtually every sector of society is experiencing a dramatic growth in the volume of unstructured textual data that is generated and published, from news and social media online interactions, through open access scholarly communications and observational data in the form of digital health records and online drug reviews. The volume and variety of data across all this range of domains has created both unprecedented opportunities and pressing challenges for extracting actionable knowledge for several application scenarios. However, the extraction of rich semantic knowledge demands the deployment of scalable and flexible automatic methods adaptable across text genres and schema specifications. Moreover, the full potential of these data can only be unlocked by coupling information extraction methods with Semantic Web techniques for the construction of full-fledged Knowledge Graphs, that are semantically transparent, explainable by design and interoperable. In this thesis, we experiment with the application of Natural Language Processing, Machine Learning and Generative AI methods, powered by Semantic Web best practices, to the automatic construction of Knowledge Graphs from large text corpora, in three use case applications: the analysis of the Digital Transformation discourse in the global news and social media platforms; the mapping and trend analysis of recent research in the Architecture, Engineering, Construction and Operations domain from a large corpus of publications; the generation of causal relation graphs of biomedical entities from electronic health records and patient-authored drug reviews. The contributions of this thesis to the research community are in terms of benchmark evaluation results, the design of customized algorithms and the creation of data resources in the form of Knowledge Graphs, together with data analysis results built on top of them.

摘要:幾乎每個社會領域都在經歷著未結構化文本數據生成和發佈量的劇增,這些數據來自於新聞和社交媒體的在線互動、開放存取的學術交流以及以數位健康記錄和在線藥物評價形式呈現的觀察數據。這些領域中數據的量和多樣性創造了前所未有的機會和迫切的挑戰,以提取可行的知識以應用於多個場景。然而,提取豐富的語義知識需要部署可擴展且靈活的自動化方法,這些方法能夠適應不同的文本類型和架構規範。此外,這些數據的全部潛力僅能通過將信息提取方法與語義網技術結合來釋放,以構建語義透明、設計上可解釋且可互操作的完整知識圖譜。在本論文中,我們實驗應用自然語言處理、機器學習和生成式AI方法,這些方法由語義網最佳實踐驅動,實現從大型文本語料庫自動構建知識圖譜,並針對三個使用案例進行應用:分析全球新聞和社交媒體平台中的數位轉型話語;從大量出版物中映射和趨勢分析建築、工程、建設和運營領域的最新研究;從電子健康記錄和患者撰寫的藥物評價中生成生物醫學實體的因果關係圖。這篇論文對研究社群的貢獻體現在基準評估結果、定制算法的設計以及以知識圖譜形式創建的數據資源,連同基於這些資源構建的數據分析結果。

Doctorina MedBench: End-to-End Evaluation of Agent-Based Medical AI

2603.25821v1 by Anna Kozlova, Stanislau Salavei, Pavel Satalkin, Hanna Plotnitskaya, Sergey Parfenyuk

We present Doctorina MedBench, a comprehensive evaluation framework for agent-based medical AI based on the simulation of realistic physician-patient interactions. Unlike traditional medical benchmarks that rely on solving standardized test questions, the proposed approach models a multi-step clinical dialogue in which either a physician or an AI system must collect medical history, analyze attached materials (including laboratory reports, images, and medical documents), formulate differential diagnoses, and provide personalized recommendations. System performance is evaluated using the D.O.T.S. metric, which consists of four components: Diagnosis, Observations/Investigations, Treatment, and Step Count, enabling assessment of both clinical correctness and dialogue efficiency. The system also incorporates a multi-level testing and quality monitoring architecture designed to detect model degradation during both development and deployment. The framework supports safety-oriented trap cases, category-based random sampling of clinical scenarios, and full regression testing. The dataset currently contains more than 1,000 clinical cases covering over 750 diagnoses. The universality of the evaluation metrics allows the framework to be used not only to assess medical AI systems, but also to evaluate physicians and support the development of clinical reasoning skills. Our results suggest that simulation of clinical dialogue may provide a more realistic assessment of clinical competence compared to traditional examination-style benchmarks.

摘要:我們提出了 Doctorina MedBench,這是一個基於模擬現實醫生-病人互動的代理醫療 AI 的綜合評估框架。與依賴解決標準化測試問題的傳統醫療基準不同,所提出的方法模型化了一個多步驟的臨床對話,在這個對話中,醫生或 AI 系統必須收集病歷、分析附加材料(包括實驗室報告、影像和醫療文件)、制定鑑別診斷並提供個性化建議。系統性能使用 D.O.T.S. 指標進行評估,該指標由四個組成部分構成:診斷、觀察/調查、治療和步驟計數,能夠評估臨床正確性和對話效率。

該系統還包含一個多層次的測試和質量監控架構,旨在在開發和部署期間檢測模型退化。該框架支持以安全為導向的陷阱案例、基於類別的臨床場景隨機抽樣以及全面的回歸測試。數據集目前包含超過 1,000 個臨床案例,涵蓋超過 750 種診斷。評估指標的普遍性使得該框架不僅可以用來評估醫療 AI 系統,還可以評估醫生並支持臨床推理技能的發展。我們的結果表明,臨床對話的模擬可能提供比傳統考試風格基準更現實的臨床能力評估。

Beyond identifiability: Learning causal representations with few environments and finite samples

2603.25796v1 by Inbeom Lee, Tongtong Jin, Bryon Aragam

We provide explicit, finite-sample guarantees for learning causal representations from data with a sublinear number of environments. Causal representation learning seeks to provide a rigourous foundation for the general representation learning problem by bridging causal models with latent factor models in order to learn interpretable representations with causal semantics. Despite a blossoming theory of identifiability in causal representation learning, estimation and finite-sample bounds are less well understood. We show that causal representations can be learned with only a logarithmic number of unknown, multi-node interventions, and that the intervention targets need not be carefully designed in advance. Through a careful perturbation analysis, we provide a new analysis of this problem that guarantees consistent recovery of (a) the latent causal graph, (b) the mixing matrix and representations, and (c) \emph{unknown} intervention targets.

摘要:我們提供了明確的有限樣本保證,以從具有次線性環境數量的數據中學習因果表示。因果表示學習旨在通過將因果模型與潛在因子模型相結合,為一般表示學習問題提供嚴謹的基礎,以學習具有因果語義的可解釋表示。儘管因果表示學習的可識別性理論正在蓬勃發展,但估計和有限樣本界限的理解仍然較少。我們展示了因果表示可以僅通過對數量的未知多節點干預來學習,並且干預目標不必事先仔細設計。通過仔細的擾動分析,我們提供了這個問題的新分析,保證一致地恢復 (a) 潛在因果圖,(b) 混合矩陣和表示,以及 (c) \emph{未知} 干預目標。

DeepFAN, a transformer-based deep learning model for human-artificial intelligence collaborative assessment of incidental pulmonary nodules in CT scans: a multi-reader, multi-case trial

2603.25607v1 by Zhenchen Zhu, Ge Hu, Weixiong Tan, Kai Gao, Chao Sun, Zhen Zhou, Kepei Xu, Wei Han, Meixia Shang, Xiaoming Qiu, Yiqing Tan, Jinhua Wang, Zhoumeng Ying, Li Peng, Wei Song, Lan Song, Zhengyu Jin, Nan Hong, Yizhou Yu

The widespread adoption of CT has notably increased the number of detected lung nodules. However, current deep learning methods for classifying benign and malignant nodules often fail to comprehensively integrate global and local features, and most of them have not been validated through clinical trials. To address this, we developed DeepFAN, a transformer-based model trained on over 10K pathology-confirmed nodules and further conducted a multi-reader, multi-case clinical trial to evaluate its efficacy in assisting junior radiologists. DeepFAN achieved diagnostic area under the curve (AUC) of 0.939 (95% CI 0.930-0.948) on an internal test set and 0.954 (95% CI 0.934-0.973) on the clinical trial dataset involving 400 cases across three independent medical institutions. Explainability analysis indicated higher contributions from global than local features. Twelve readers' average performance significantly improved by 10.9% (95% CI 8.3%-13.5%) in AUC, 10.0% (95% CI 8.9%-11.1%) in accuracy, 7.6% (95% CI 6.1%-9.2%) in sensitivity, and 12.6% (95% CI 10.9%-14.3%) in specificity (P<0.001 for all). Nodule-level inter-reader diagnostic consistency improved from fair to moderate (overall k: 0.313 vs. 0.421; P=0.019). In conclusion, DeepFAN effectively assisted junior radiologists and may help homogenize diagnostic quality and reduce unnecessary follow-up of indeterminate pulmonary nodules. Chinese Clinical Trial Registry: ChiCTR2400084624.

摘要:CT的廣泛應用顯著增加了檢測到的肺結節數量。然而,當前用於分類良性和惡性結節的深度學習方法往往未能全面整合全球和局部特徵,且大多數尚未通過臨床試驗進行驗證。為了解決這個問題,我們開發了DeepFAN,一種基於Transformer的模型,該模型在超過10K病理確認的結節上進行了訓練,並進一步進行了多讀者、多案例的臨床試驗,以評估其在輔助初級放射科醫生方面的有效性。DeepFAN在內部測試集上達到了0.939的診斷曲線下面積(AUC)(95% CI 0.930-0.948),在涉及三個獨立醫療機構的400個案例的臨床試驗數據集上達到了0.954(95% CI 0.934-0.973)。可解釋性分析顯示全球特徵的貢獻高於局部特徵。十二位讀者的平均表現顯著提高了10.9%(95% CI 8.3%-13.5%)的AUC,10.0%(95% CI 8.9%-11.1%)的準確率,7.6%(95% CI 6.1%-9.2%)的敏感性,以及12.6%(95% CI 10.9%-14.3%)的特異性(所有P<0.001)。結節級別的讀者間診斷一致性從公平改善到中等(整體k: 0.313 vs. 0.421; P=0.019)。總之,DeepFAN有效地輔助了初級放射科醫生,並可能有助於均化診斷質量,減少對不確定肺結節的不必要隨訪。中國臨床試驗登記:ChiCTR2400084624。

Interpretable PM2.5 Forecasting for Urban Air Quality: A Comparative Study of Operational Time-Series Models

2603.25495v1 by Moazzam Umer Gondal, Hamad ul Qudous, Asma Ahmad Farhan, Sultan Alamri

Accurate short-term air-quality forecasting is essential for public health protection and urban management, yet many recent forecasting frameworks rely on complex, data-intensive, and computationally demanding models. This study investigates whether lightweight and interpretable forecasting approaches can provide competitive performance for hourly PM2.5 prediction in Beijing, China. Using multi-year pollutant and meteorological time-series data, we developed a leakage-aware forecasting workflow that combined chronological data partitioning, preprocessing, feature selection, and exogenous-driver modeling under the Perfect Prognosis setting. Three forecasting families were evaluated: SARIMAX, Facebook Prophet, and NeuralProphet. To assess practical deployment behavior, the models were tested under two adaptive regimes: weekly walk-forward refitting and frozen forecasting with online residual correction. Results showed clear differences in both predictive accuracy and computational efficiency. Under walk-forward refitting, Facebook Prophet achieved the strongest completed performance, with an MAE of $37.61$ and an RMSE of $50.10$, while also requiring substantially less execution time than NeuralProphet. In the frozen-model regime, online residual correction improved Facebook Prophet and SARIMAX, with corrected SARIMAX yielding the lowest overall error (MAE $32.50$; RMSE $46.85$). NeuralProphet remained less accurate and less stable across both regimes, and residual correction did not improve its forecasts. Notably, corrected Facebook Prophet reached nearly the same error as its walk-forward counterpart while reducing runtime from $15$ min $21.91$ sec to $46.60$ sec. These findings show that lightweight additive forecasting strategies can remain highly competitive for urban air-quality prediction, offering a practical balance between accuracy, interpretability, ...

摘要:準確的短期空氣質量預測對於公共健康保護和城市管理至關重要,但許多最近的預測框架依賴於複雜、數據密集且計算需求高的模型。這項研究探討輕量且可解釋的預測方法是否能在中國北京的每小時 PM2.5 預測中提供競爭性的表現。利用多年來的污染物和氣象時間序列數據,我們開發了一個考慮洩漏的預測工作流程,該流程結合了時間數據分區、預處理、特徵選擇和外部驅動建模,並在完美預測的設置下進行。評估了三個預測家族:SARIMAX、Facebook Prophet 和 NeuralProphet。為了評估實際部署行為,這些模型在兩種自適應模式下進行了測試:每週前向重擬合和帶有在線殘差修正的冷凍預測。結果顯示在預測準確性和計算效率上存在明顯差異。在前向重擬合下,Facebook Prophet 實現了最強的完成表現,MAE 為 $37.61$,RMSE 為 $50.10$,同時所需的執行時間也顯著少於 NeuralProphet。在冷凍模型模式下,在線殘差修正改善了 Facebook Prophet 和 SARIMAX,修正後的 SARIMAX 產生了最低的整體誤差(MAE $32.50$;RMSE $46.85$)。NeuralProphet 在這兩種模式下的準確性和穩定性均較低,且殘差修正並未改善其預測。值得注意的是,修正後的 Facebook Prophet 的誤差幾乎與其前向對應物相同,同時將運行時間從 $15$ 分 $21.91$ 秒減少到 $46.60$ 秒。這些發現顯示,輕量的加法預測策略在城市空氣質量預測中仍然可以保持高度競爭力,提供準確性、可解釋性之間的實際平衡,...

Shape and Substance: Dual-Layer Side-Channel Attacks on Local Vision-Language Models

2603.25403v2 by Eyal Hadad, Mordechai Guri

On-device Vision-Language Models (VLMs) promise data privacy via local execution. However, we show that the architectural shift toward Dynamic High-Resolution preprocessing (e.g., AnyRes) introduces an inherent algorithmic side-channel. Unlike static models, dynamic preprocessing decomposes images into a variable number of patches based on their aspect ratio, creating workload-dependent inputs. We demonstrate a dual-layer attack framework against local VLMs. In Tier 1, an unprivileged attacker can exploit significant execution-time variations using standard unprivileged OS metrics to reliably fingerprint the input's geometry. In Tier 2, by profiling Last-Level Cache (LLC) contention, the attacker can resolve semantic ambiguity within identical geometries, distinguishing between visually dense (e.g., medical X-rays) and sparse (e.g., text documents) content. By evaluating state-of-the-art models such as LLaVA-NeXT and Qwen2-VL, we show that combining these signals enables reliable inference of privacy-sensitive contexts. Finally, we analyze the security engineering trade-offs of mitigating this vulnerability, reveal substantial performance overhead with constant-work padding, and propose practical design recommendations for secure Edge AI deployments.

摘要:在裝置上的視覺-語言模型(VLMs)透過本地執行承諾數據隱私。然而,我們顯示出向動態高解析度預處理(例如 AnyRes)的架構轉變引入了一個固有的算法側信道。與靜態模型不同,動態預處理根據圖像的長寬比將其分解為可變數量的補丁,創造出依賴於工作負載的輸入。我們展示了一個針對本地 VLMs 的雙層攻擊框架。在第一層,未特權的攻擊者可以利用標準未特權操作系統指標來利用顯著的執行時間變化,可靠地指紋輸入的幾何形狀。在第二層,通過分析最後級快取(LLC)的競爭,攻擊者可以解決相同幾何形狀中的語義模糊,區分視覺上密集(例如醫療 X 光)和稀疏(例如文本文件)內容。通過評估最先進的模型,如 LLaVA-NeXT 和 Qwen2-VL,我們顯示結合這些信號能夠可靠地推斷隱私敏感的上下文。最後,我們分析了減輕此漏洞的安全工程權衡,揭示了使用常數工作填充的顯著性能開銷,並提出了安全邊緣 AI 部署的實用設計建議。

A Causal Framework for Evaluating ICU Discharge Strategies

2603.25397v1 by Sagar Nagaraj Simha, Juliette Ortholand, Dave Dongelmans, Jessica D. Workum, Olivier W. M. Thijssens, Ameen Abu-Hanna, Giovanni Cinà

In this applied paper, we address the difficult open problem of when to discharge patients from the Intensive Care Unit. This can be conceived as an optimal stopping scenario with three added challenges: 1) the evaluation of a stopping strategy from observational data is itself a complex causal inference problem, 2) the composite objective is to minimize the length of intervention and maximize the outcome, but the two cannot be collapsed to a single dimension, and 3) the recording of variables stops when the intervention is discontinued. Our contributions are two-fold. First, we generalize the implementation of the g-formula Python package, providing a framework to evaluate stopping strategies for problems with the aforementioned structure, including positivity and coverage checks. Second, with a fully open-source pipeline, we apply this approach to MIMIC-IV, a public ICU dataset, demonstrating the potential for strategies that improve upon current care.

摘要:在這篇應用論文中,我們探討了何時將病人從加護病房出院這一困難的未解問題。這可以被視為一個最佳停止情境,但面臨三個額外挑戰:1) 從觀察數據評估停止策略本身就是一個複雜的因果推斷問題,2) 複合目標是最小化干預時間並最大化結果,但這兩者無法簡化為單一維度,以及 3) 當干預停止時,變數的記錄也會停止。我們的貢獻有兩方面。首先,我們推廣了 g-formula Python 套件的實施,提供了一個框架來評估具有上述結構的問題的停止策略,包括正性和覆蓋性檢查。其次,通過一個完全開源的流程,我們將這一方法應用於 MIMIC-IV,一個公共的加護病房數據集,展示了改進當前護理的策略潛力。

Evaluating Language Models for Harmful Manipulation

2603.25326v2 by Canfer Akbulut, Rasmi Elasmar, Abhishek Roy, Anthony Payne, Priyanka Suresh, Lujain Ibrahim, Seliem El-Sayed, Charvi Rastogi, Ashyana Kachra, Will Hawkins, Kristian Lum, Laura Weidinger

Interest in the concept of AI-driven harmful manipulation is growing, yet current approaches to evaluating it are limited. This paper introduces a framework for evaluating harmful AI manipulation via context-specific human-AI interaction studies. We illustrate the utility of this framework by assessing an AI model with 10,101 participants spanning interactions in three AI use domains (public policy, finance, and health) and three locales (US, UK, and India). Overall, we find that that the tested model can produce manipulative behaviours when prompted to do so and, in experimental settings, is able to induce belief and behaviour changes in study participants. We further find that context matters: AI manipulation differs between domains, suggesting that it needs to be evaluated in the high-stakes context(s) in which an AI system is likely to be used. We also identify significant differences across our tested geographies, suggesting that AI manipulation results from one geographic region may not generalise to others. Finally, we find that the frequency of manipulative behaviours (propensity) of an AI model is not consistently predictive of the likelihood of manipulative success (efficacy), underscoring the importance of studying these dimensions separately. To facilitate adoption of our evaluation framework, we detail our testing protocols and make relevant materials publicly available. We conclude by discussing open challenges in evaluating harmful manipulation by AI models.

摘要:對於AI驅動的有害操控概念的興趣正在增長,但目前評估該概念的方法仍然有限。本文介紹了一個通過特定情境的人類與AI互動研究來評估有害AI操控的框架。我們通過評估一個擁有10,101名參與者的AI模型來說明這個框架的實用性,這些參與者的互動涵蓋了三個AI使用領域(公共政策、金融和健康)以及三個地區(美國、英國和印度)。總體而言,我們發現該模型在被提示時能夠產生操控行為,並且在實驗環境中能夠引起研究參與者的信念和行為變化。我們進一步發現情境是重要的:AI操控在不同領域之間存在差異,這表明需要在AI系統可能被使用的高風險情境中進行評估。我們還發現我們測試的地理區域之間存在顯著差異,這表明來自某一地理區域的AI操控結果可能無法推廣到其他地區。最後,我們發現AI模型的操控行為頻率(傾向)並不總是能預測操控成功的可能性(效能),這強調了分開研究這些維度的重要性。為了促進我們評估框架的採用,我們詳細說明了我們的測試協議並公開相關材料。我們最後討論了評估AI模型有害操控的開放挑戰。

AD-CARE: A Guideline-grounded, Modality-agnostic LLM Agent for Real-world Alzheimer's Disease Diagnosis with Multi-cohort Assessment, Fairness Analysis, and Reader Study

2603.25322v1 by Wenlong Hou, Sheng Bi, Guangqian Yang, Lihao Liu, Ye Du, Hanxiao Xue, Juncheng Wang, Yuxiang Feng, Yue Xun, Nanxi Yu, Ning Mao, Mo Yang, Yi Wah Eva Cheung, Ling Long, Kay Chen Tan, Lequan Yu, Xiaomeng Ma, Shaozhen Yan, Shujun Wang

Alzheimer's disease (AD) is a growing global health challenge as populations age, and timely, accurate diagnosis is essential to reduce individual and societal burden. However, real-world AD assessment is hampered by incomplete, heterogeneous multimodal data and variability across sites and patient demographics. Although large language models (LLMs) have shown promise in biomedicine, their use in AD has largely been confined to answering narrow, disease-specific questions rather than generating comprehensive diagnostic reports that support clinical decision-making. Here we expand LLM capabilities for clinical decision support by introducing AD-CARE, a modality-agnostic agent that performs guideline-grounded diagnostic assessment from incomplete, heterogeneous inputs without imputing missing modalities. By dynamically orchestrating specialized diagnostic tools and embedding clinical guidelines into LLM-driven reasoning, AD-CARE generates transparent, report-style outputs aligned with real-world clinical workflows. Across six cohorts comprising 10,303 cases, AD-CARE achieved 84.9% diagnostic accuracy, delivering 4.2%-13.7% relative improvements over baseline methods. Despite cohort-level differences, dataset-specific accuracies remain robust (80.4%-98.8%), and the agent consistently outperforms all baselines. AD-CARE reduced performance disparities across racial and age subgroups, decreasing the average dispersion of four metrics by 21%-68% and 28%-51%, respectively. In a controlled reader study, the agent improved neurologist and radiologist accuracy by 6%-11% and more than halved decision time. The framework yielded 2.29%-10.66% absolute gains over eight backbone LLMs and converges their performance. These results show that AD-CARE is a scalable, practically deployable framework that can be integrated into routine clinical workflows for multimodal decision support in AD.

摘要:阿茲海默症(AD)隨著人口老化而成為一個日益嚴重的全球健康挑戰,及時且準確的診斷對於減少個人和社會負擔至關重要。然而,現實世界中的AD評估受到不完整、異質的多模態數據以及不同地點和患者人口統計變異的阻礙。儘管大型語言模型(LLMs)在生物醫學領域顯示出潛力,但它們在AD中的應用主要限於回答狹隘的、特定於疾病的問題,而不是生成支持臨床決策的綜合診斷報告。在此,我們通過引入AD-CARE擴展LLM在臨床決策支持中的能力,這是一個與模態無關的代理,能夠從不完整、異質的輸入中執行基於指南的診斷評估,而無需填補缺失的模態。通過動態協調專門的診斷工具並將臨床指南嵌入LLM驅動的推理中,AD-CARE生成與現實世界臨床工作流程相符的透明報告風格輸出。在包括10,303個案例的六個隊列中,AD-CARE達到了84.9%的診斷準確率,較基準方法提高了4.2%-13.7%的相對改進。儘管隊列層級存在差異,數據集特定的準確率仍然穩健(80.4%-98.8%),且該代理始終超越所有基準。AD-CARE減少了不同種族和年齡子群之間的性能差異,分別降低了四個指標的平均離散度21%-68%和28%-51%。在一項受控讀者研究中,該代理提高了神經科醫生和放射科醫生的準確率6%-11%,並將決策時間減半以上。該框架在八個基礎LLM上實現了2.29%-10.66%的絕對增益,並使其性能收斂。這些結果顯示,AD-CARE是一個可擴展、實際可部署的框架,可以整合進入AD的常規臨床工作流程中,以提供多模態的決策支持。

A Gait Foundation Model Predicts Multi-System Health Phenotypes from 3D Skeletal Motion

2603.25283v1 by Adam Gabet, Sarah Kohn, Guy Lutsker, Shira Gelman, Anastasia Godneva, Gil Sasson, Arad Zulti, David Krongauz, Rotem Shaulitch, Assaf Rotem, Ohad Doron, Yuval Brodsky, Adina Weinberger, Eran Segal

Gait is increasingly recognized as a vital sign, yet current approaches treat it as a symptom of specific pathologies rather than a systemic biomarker. We developed a gait foundation model for 3D skeletal motion from 3,414 deeply phenotyped adults, recorded via a depth camera during five motor tasks. Learned embeddings outperformed engineered features, predicting age (Pearson r = 0.69), BMI (r = 0.90), and visceral adipose tissue area (r = 0.82). Embeddings significantly predicted 1,980 of 3,210 phenotypic targets; after adjustment for age, BMI, VAT, and height, gait provided independent gains in all 18 body systems in males and 17 of 18 in females, and improved prediction of clinical diagnoses and medication use. Anatomical ablation revealed that legs dominated metabolic and frailty predictions while torso encoded sleep and lifestyle phenotypes. These findings establish gait as an independent multi-system biosignal, motivating translation to consumer-grade video and its integration as a scalable, passive vital sign.

摘要:步態越來越被認為是一種重要的生命體徵,然而目前的方法將其視為特定病理的症狀,而非系統性的生物標記。我們從3,414名深度表型的成年人中開發了一個3D骨骼運動的步態基礎模型,這些數據是通過深度相機在五項運動任務中錄製的。學習到的嵌入超越了工程特徵,能夠預測年齡(Pearson r = 0.69)、BMI(r = 0.90)和內臟脂肪組織面積(r = 0.82)。嵌入顯著預測了3,210個表型目標中的1,980個;在調整年齡、BMI、VAT和身高後,步態在男性的所有18個身體系統中提供了獨立的增益,而在女性中則是18個系統中的17個,並改善了臨床診斷和用藥的預測。解剖性切除顯示,腿部主導了代謝和虛弱的預測,而軀幹則編碼了睡眠和生活方式的表型。這些發現確立了步態作為一種獨立的多系統生物信號,促使其轉化為消費級視頻並作為可擴展的被動生命體徵整合。

A Decade-Scale Benchmark Evaluating LLMs' Clinical Practice Guidelines Detection and Adherence in Multi-turn Conversations

2603.25196v1 by Andong Tan, Shuyu Dai, Jinglu Wang, Fengtao Zhou, Yan Lu, Xi Wang, Yingcong Chen, Can Yang, Shujie Liu, Hao Chen

Clinical practice guidelines (CPGs) play a pivotal role in ensuring evidence-based decision-making and improving patient outcomes. While Large Language Models (LLMs) are increasingly deployed in healthcare scenarios, it is unclear to which extend LLMs could identify and adhere to CPGs during conversations. To address this gap, we introduce CPGBench, an automated framework benchmarking the clinical guideline detection and adherence capabilities of LLMs in multi-turn conversations. We collect 3,418 CPG documents from 9 countries/regions and 2 international organizations published in the last decade spanning across 24 specialties. From these documents, we extract 32,155 clinical recommendations with corresponding publication institute, date, country, specialty, recommendation strength, evidence level, etc. One multi-turn conversation is generated for each recommendation accordingly to evaluate the detection and adherence capabilities of 8 leading LLMs. We find that the 71.1%-89.6% recommendations can be correctly detected, while only 3.6%-29.7% corresponding titles can be correctly referenced, revealing the gap between knowing the guideline contents and where they come from. The adherence rates range from 21.8% to 63.2% in different models, indicating a large gap between knowing the guidelines and being able to apply them. To confirm the validity of our automatic analysis, we further conduct a comprehensive human evaluation involving 56 clinicians from different specialties. To our knowledge, CPGBench is the first benchmark systematically revealing which clinical recommendations LLMs fail to detect or adhere to during conversations. Given that each clinical recommendation may affect a large population and that clinical applications are inherently safety critical, addressing these gaps is crucial for the safe and responsible deployment of LLMs in real world clinical practice.

摘要:臨床實踐指導方針(CPGs)在確保基於證據的決策和改善患者結果方面發揮著關鍵作用。儘管大型語言模型(LLMs)在醫療場景中的應用越來越普遍,但尚不清楚LLMs在對話中能在多大程度上識別和遵循CPGs。為了解決這一空白,我們引入了CPGBench,這是一個自動化框架,用於基準測試LLMs在多輪對話中檢測和遵循臨床指導方針的能力。我們收集了來自9個國家/地區和2個國際組織的3,418份CPG文件,這些文件在過去十年內發表,涵蓋了24個專業領域。從這些文件中,我們提取了32,155條臨床建議,並附上相應的出版機構、日期、國家、專業、建議強度、證據水平等信息。根據每條建議生成一個多輪對話,以評估8個領先LLMs的檢測和遵循能力。我們發現71.1%-89.6%的建議可以被正確檢測,而只有3.6%-29.7%的相應標題可以被正確引用,揭示了了解指導方針內容與其來源之間的差距。不同模型的遵循率範圍從21.8%到63.2%,顯示出了解指導方針與能夠應用它們之間的巨大差距。為了確認我們自動分析的有效性,我們進一步進行了一項綜合性的人類評估,涉及來自不同專業的56名臨床醫生。據我們所知,CPGBench是第一個系統性揭示LLMs在對話中未能檢測或遵循的臨床建議的基準測試。鑑於每條臨床建議可能影響大量人群,且臨床應用本質上具有安全關鍵性,解決這些差距對於在現實世界臨床實踐中安全和負責任地部署LLMs至關重要。

Empowering Epidemic Response: The Role of Reinforcement Learning in Infectious Disease Control

2603.25771v1 by Mutong Liu, Yang Liu, Jiming Liu

Reinforcement learning (RL), owing to its adaptability to various dynamic systems in many real-world scenarios and the capability of maximizing long-term outcomes under different constraints, has been used in infectious disease control to optimize the intervention strategies for controlling infectious disease spread and responding to outbreaks in recent years. The potential of RL for assisting public health sectors in preventing and controlling infectious diseases is gradually emerging and being explored by rapidly increasing publications relevant to COVID-19 and other infectious diseases. However, few surveys exclusively discuss this topic, that is, the development and application of RL approaches for optimizing strategies of non-pharmaceutical and pharmaceutical interventions of public health. Therefore, this paper aims to provide a concise review and discussion of the latest literature on how RL approaches have been used to assist in controlling the spread and outbreaks of infectious diseases, covering several critical topics addressing public health demands: resource allocation, balancing between lives and livelihoods, mixed policy of multiple interventions, and inter-regional coordinated control. Finally, we conclude the paper with a discussion of several potential directions for future research.

摘要:強化學習(RL)因其對許多現實世界場景中各種動態系統的適應性以及在不同約束條件下最大化長期結果的能力,近年來已被用於傳染病控制,以優化控制傳染病擴散和應對疫情的干預策略。 RL在協助公共衛生部門預防和控制傳染病方面的潛力逐漸顯現,並且與COVID-19及其他傳染病相關的出版物迅速增加,正在被探索。然而,專門討論這一主題的調查很少,即針對優化公共衛生非藥物和藥物干預策略的RL方法的發展和應用。因此,本文旨在提供對最新文獻的簡要回顧和討論,探討RL方法如何被用於協助控制傳染病的擴散和疫情,涵蓋幾個關鍵主題以應對公共衛生需求:資源分配、生命與生計之間的平衡、多種干預的混合政策,以及區域間的協調控制。最後,我們以對未來研究幾個潛在方向的討論作為本文的結尾。

Photon: Speedup Volume Understanding with Efficient Multimodal Large Language Models

2603.25155v1 by Chengyu Fang, Heng Guo, Zheng Jiang, Chunming He, Xiu Li, Minfeng Xu

Multimodal large language models are promising for clinical visual question answering tasks, but scaling to 3D imaging is hindered by high computational costs. Prior methods often rely on 2D slices or fixed-length token compression, disrupting volumetric continuity and obscuring subtle findings. We present Photon, a framework that represents 3D medical volumes with token sequences of variable length. Photon introduces instruction-conditioned token scheduling and surrogate gradient propagation to adaptively reduce tokens during both training and inference, which lowers computational cost while mitigating the attention dilution caused by redundant tokens. It incorporates a custom backpropagation rule with gradient restoration to enable differentiable optimization despite discrete token drop. To stabilize token compression and ensure reliable use of visual evidence, Photon further applies regularization objectives that mitigate language-only bias and improve reliability. Experiments on diverse medical visual question answering tasks show that Photon achieves state-of-the-art accuracy while reducing resource usage and accelerating both training and inference.

摘要:多模態大型語言模型在臨床視覺問題回答任務中展現出良好的前景,但在擴展到3D影像時受到高計算成本的阻礙。先前的方法通常依賴於2D切片或固定長度的標記壓縮,這破壞了體積連續性並掩蓋了微妙的發現。我們提出了Photon,一個用變長標記序列表示3D醫療體積的框架。Photon引入了基於指令的標記調度和替代梯度傳播,能夠在訓練和推理過程中自適應地減少標記,從而降低計算成本,同時減輕由冗餘標記引起的注意力稀釋。它還結合了一個自定義的反向傳播規則,通過梯度恢復來實現可微分優化,儘管存在離散標記丟失。為了穩定標記壓縮並確保視覺證據的可靠使用,Photon進一步應用了正則化目標,以減輕僅基於語言的偏見並提高可靠性。在多樣的醫療視覺問題回答任務中的實驗顯示,Photon在降低資源使用和加速訓練及推理的同時,達到了最先進的準確率。

Factors Influencing the Quality of AI-Generated Code: A Synthesis of Empirical Evidence

2603.25146v1 by Vehid Geruslu, Zulfiyya Aliyeva, Eray Tüzün

Context: The rapid adoption of AI-assisted code generation tools, such as large language models (LLMs), is transforming software development practices. While these tools promise significant productivity gains, concerns regarding the quality, reliability, and security of AI-generated code are increasingly reported in both academia and industry. --Objective: This study aims to systematically synthesize existing empirical evidence on the factors influencing the quality of AI-generated source code and to analyze how these factors impact software quality outcomes across different evaluation contexts. --Method: We conducted a systematic literature review (SLR) following established guidelines, supported by an AI-assisted workflow with human oversight. A total of 24 primary studies were selected through a structured search and screening process across major digital libraries. Data were extracted and analyzed using qualitative, pattern-based evidence synthesis. --Results: The findings reveal that code quality in AI-assisted development is influenced by a combination of human factors, AI system characteristics, and human AI interaction dynamics. Key influencing factors include prompt design, task specification, and developer expertise. The results also show variability in quality outcomes such as correctness, security, maintainability, and complexity across studies, with both improvements and risks reported. --Conclusion: AI-assisted code generation represents a socio-technical shift in software engineering, where achieving high-quality outcomes depends on both technological and human factors. While promising, AI-generated code requires careful validation and integration into development workflows.

摘要:Context: AI輔助的程式碼生成工具(如大型語言模型(LLMs))的快速採用正在改變軟體開發實踐。雖然這些工具承諾顯著的生產力提升,但關於AI生成程式碼的質量、可靠性和安全性的擔憂在學術界和業界中越來越多地被報導。--Objective: 本研究旨在系統性地綜合現有的實證證據,探討影響AI生成源碼質量的因素,並分析這些因素如何影響不同評估背景下的軟體質量結果。--Method: 我們按照既定指導方針進行了系統性文獻回顧(SLR),並在人工監督下支持AI輔助的工作流程。通過結構化的搜索和篩選過程,從主要數字圖書館中選擇了共24項主要研究。數據通過定性、基於模式的證據綜合進行提取和分析。--Results: 研究結果顯示,AI輔助開發中的程式碼質量受到人為因素、AI系統特徵和人機互動動態的綜合影響。主要影響因素包括提示設計、任務規範和開發者專業知識。結果還顯示,不同研究中的質量結果(如正確性、安全性、可維護性和複雜性)存在變異,報告了改進和風險。--Conclusion: AI輔助程式碼生成代表了軟體工程中的社會技術轉變,實現高質量結果依賴於技術和人為因素。雖然前景看好,但AI生成的程式碼需要仔細驗證並整合進開發工作流程中。

Rethinking Health Agents: From Siloed AI to Collaborative Decision Mediators

2603.24986v1 by Ray-Yuan Chung, Xuhai Xu, Ari Pollack

Large language model based health agents are increasingly used by health consumers and clinicians to interpret health information and guide health decisions. However, most AI systems in healthcare operate in siloed configurations, supporting individual users rather than the multi-stakeholder relationships central to healthcare. Such use can fragment understanding and exacerbate misalignment among patients, caregivers, and clinicians. We reframe AI not as a standalone assistant, but as a collaborator embedded within multi-party care interactions. Through a clinically validated fictional pediatric chronic kidney disease case study, we show that breakdowns in adherence stem from fragmented situational awareness and misaligned goals, and that siloed use of general-purpose AI tools does little to address these collaboration gaps. We propose a conceptual framework for designing AI collaborators that surface contextual information, reconcile mental models, and scaffold shared understanding while preserving human decision authority.

摘要:大型語言模型基礎的健康代理人越來越多地被健康消費者和臨床醫生用來解釋健康信息並指導健康決策。然而,大多數醫療保健中的人工智慧系統運作在孤立的配置中,支持個別用戶,而不是醫療保健中多方利益相關者關係的核心。這種使用方式可能會導致理解的碎片化,並加劇患者、照護者和臨床醫生之間的目標不一致。我們將人工智慧重新定義為一個嵌入多方護理互動中的合作者,而不是一個獨立的助手。通過一個臨床驗證的虛構小兒慢性腎病案例研究,我們顯示出遵循的中斷源於情境意識的碎片化和目標的不一致,而通用人工智慧工具的孤立使用對於解決這些協作差距幾乎沒有幫助。我們提出了一個設計人工智慧合作者的概念框架,旨在呈現上下文信息、調和心理模型,並支撐共享理解,同時保留人類的決策權威。

Subject-Specific Low-Field MRI Synthesis via a Neural Operator

2603.24968v1 by Ziqi Gao, Nicha Dvornek, Xiaoran Zhang, Gigi Galiana, Hemant Tagare, Todd Constable

Low-field (LF) magnetic resonance imaging (MRI) improves accessibility and reduces costs but generally has lower signal-to-noise ratios and degraded contrast compared to high field (HF) MRI, limiting its clinical utility. Simulating LF MRI from HF MRI enables virtual evaluation of novel imaging devices and development of LF algorithms. Existing low field simulators rely on noise injection and smoothing, which fail to capture the contrast degradation seen in LF acquisitions. To this end, we introduce an end-to-end LF-MRI synthesis framework that learns HF to LF image degradation directly from a small number of paired HF-LF MRIs. Specifically, we introduce a novel HF to LF coordinate-image decoupled neural operator (H2LO) to model the underlying degradation process, and tailor it to capture high-frequency noise textures and image structure. Experimental results in T1w and T2w MRI demonstrate that H2LO produces more faithful simulated low-field images than existing parameterized noise synthesis models and popular image-to-image translation models. Furthermore, it improves performance in downstream image enhancement tasks, showcasing its potential to enhance LF MRI diagnostic capabilities.

摘要:低場(LF)磁共振成像(MRI)提高了可及性並降低了成本,但與高場(HF)MRI相比,通常具有較低的信噪比和較差的對比度,限制了其臨床實用性。從HF MRI模擬LF MRI使得對新型成像設備的虛擬評估和LF算法的開發成為可能。現有的低場模擬器依賴於噪聲注入和平滑,未能捕捉到LF獲取中觀察到的對比度降解。為此,我們引入了一個端到端的LF-MRI合成框架,該框架直接從少量配對的HF-LF MRI中學習HF到LF的圖像降解。具體而言,我們引入了一種新穎的HF到LF坐標-圖像解耦神經運算子(H2LO),以建模潛在的降解過程,並調整其以捕捉高頻噪聲紋理和圖像結構。在T1w和T2w MRI中的實驗結果顯示,H2LO生成的模擬低場圖像比現有的參數化噪聲合成模型和流行的圖像到圖像轉換模型更為真實。此外,它在下游圖像增強任務中的表現有所提升,展示了其增強LF MRI診斷能力的潛力。

Sovereign AI at the Front Door of Care: A Physically Unidirectional Architecture for Secure Clinical Intelligence

2603.24898v1 by Vasu Srinivasan, Dhriti Vasu

We present a Sovereign AI architecture for clinical triage in which all inference is performed on-device and inbound data is delivered via a physically unidirectional channel, implemented using receive-only broadcast infrastructure or certified hardware data diodes, with no return path to any external network. This design removes the network-mediated attack surface by construction, rather than attempting to secure it through software controls. The system performs conversational symptom intake, integrates device-captured vitals, and produces structured, triage-aligned clinical records at the point of care. We formalize the security properties of receiver-side unidirectionality and show that the architecture is transport-agnostic across broadcast and diode-enforced deployments. We further analyze threat models, enforcement mechanisms, and deployment configurations, demonstrating how physical one-way data flow enables high-assurance operation in both resource-constrained and high-risk environments. This work positions physically unidirectional channels as a foundational primitive for sovereign, on-device clinical intelligence at the front door of care.

摘要:我們提出了一種主權人工智慧架構,用於臨床分診,其中所有推理都在設備上進行,進入數據通過物理單向通道傳送,該通道使用僅接收的廣播基礎設施或經認證的硬體數據二極體實現,並且沒有返回路徑通往任何外部網絡。這一設計通過構建消除了網絡介導的攻擊面,而不是試圖通過軟體控制來保護它。系統執行對話式症狀收集,整合設備捕獲的生命體徵,並在護理現場產生結構化、與分診對齊的臨床記錄。我們正式化了接收端單向性的安全性質,並顯示該架構在廣播和二極體強制部署中是傳輸無關的。我們進一步分析了威脅模型、執行機制和部署配置,展示了物理單向數據流如何在資源受限和高風險環境中實現高保證操作。這項工作將物理單向通道定位為主權、設備內臨床智能的基礎原語,位於護理的前門。

More Than "Means to an End": Supporting Reasoning with Transparently Designed AI Data Science Processes

2603.24877v1 by Venkatesh Sivaraman, Patrick Vossler, Adam Perer, Julian Hong, Jean Feng

Generative artificial intelligence (AI) tools can now help people perform complex data science tasks regardless of their expertise. While these tools have great potential to help more people work with data, their end-to-end approach does not support users in evaluating alternative approaches and reformulating problems, both critical to solving open-ended tasks in high-stakes domains. In this paper, we reflect on two AI data science systems designed for the medical setting and how they function as tools for thought. We find that success in these systems was driven by constructing AI workflows around intentionally-designed intermediate artifacts, such as readable query languages, concept definitions, or input-output examples. Despite opaqueness in other parts of the AI process, these intermediates helped users reason about important analytical choices, refine their initial questions, and contribute their unique knowledge. We invite the HCI community to consider when and how intermediate artifacts should be designed to promote effective data science thinking.

摘要:生成式人工智慧(AI)工具現在可以幫助人們執行複雜的數據科學任務,而不論他們的專業知識如何。雖然這些工具具有幫助更多人處理數據的巨大潛力,但它們的端到端方法並不支持用戶評估替代方法和重新定義問題,這對於在高風險領域解決開放式任務至關重要。在本文中,我們反思了為醫療環境設計的兩個AI數據科學系統,以及它們如何作為思考工具運作。我們發現,這些系統的成功是通過圍繞故意設計的中介產物構建AI工作流程來驅動的,例如可讀的查詢語言、概念定義或輸入-輸出示例。儘管AI過程的其他部分不透明,但這些中介幫助用戶推理重要的分析選擇、精煉他們的初始問題並貢獻他們獨特的知識。我們邀請HCI社群考慮何時以及如何設計中介產物,以促進有效的數據科學思維。

Reaching Beyond the Mode: RL for Distributional Reasoning in Language Models

2603.24844v1 by Isha Puri, Mehul Damani, Idan Shenfeld, Marzyeh Ghassemi, Jacob Andreas, Yoon Kim

Given a question, a language model (LM) implicitly encodes a distribution over possible answers. In practice, post-training procedures for LMs often collapse this distribution onto a single dominant mode. While this is generally not a problem for benchmark-style evaluations that assume one correct answer, many real-world tasks inherently involve multiple valid answers or irreducible uncertainty. Examples include medical diagnosis, ambiguous question answering, and settings with incomplete information. In these cases, we would like LMs to generate multiple plausible hypotheses, ideally with confidence estimates for each one, and without computationally intensive repeated sampling to generate non-modal answers. This paper describes a multi-answer reinforcement learning approach for training LMs to perform distributional reasoning over multiple answers during inference. We modify the RL objective to enable models to explicitly generate multiple candidate answers in a single forward pass, internalizing aspects of inference-time search into the model's generative process. Across question-answering, medical diagnostic, and coding benchmarks, we observe improved diversity, coverage, and set-level calibration scores compared to single answer trained baselines. Models trained with our approach require fewer tokens to generate multiple answers than competing approaches. On coding tasks, they are also substantially more accurate. These results position multi-answer RL as a principled and compute-efficient alternative to inference-time scaling procedures such as best-of-k. Code and more information can be found at https://multi-answer-rl.github.io/.

摘要:給定一個問題,語言模型(LM)隱式地編碼了一個可能答案的分佈。在實踐中,LM 的後訓練程序通常會將這個分佈壓縮到一個主要模式上。雖然這對於假設有一個正確答案的基準風格評估來說通常不是問題,但許多現實世界的任務本質上涉及多個有效答案或不可簡化的不確定性。例子包括醫療診斷、模糊問題回答和信息不完整的情境。在這些情況下,我們希望 LM 能夠生成多個合理的假設,理想情況下為每個假設提供信心估計,而不需要計算密集的重複抽樣來生成非模式答案。本文描述了一種多答案強化學習方法,用於訓練 LM 在推理過程中進行多個答案的分佈推理。我們修改了強化學習目標,使模型能夠在單次前向傳播中明確生成多個候選答案,將推理時搜索的各個方面內化到模型的生成過程中。在問題回答、醫療診斷和編碼基準測試中,我們觀察到與單一答案訓練基準相比,模型的多樣性、覆蓋率和集合級別校準得分有所改善。使用我們的方法訓練的模型生成多個答案所需的標記數量少於競爭性方法。在編碼任務中,這些模型的準確性也顯著更高。這些結果使多答案強化學習成為推理時擴展程序(如最佳的 k)的一種原則性和計算高效的替代方案。代碼和更多信息可以在 https://multi-answer-rl.github.io/ 找到。

A Practical Guide Towards Interpreting Time-Series Deep Clinical Predictive Models: A Reproducibility Study

2603.24828v1 by Yongda Fan, John Wu, Andrea Fitzpatrick, Naveen Baskaran, Jimeng Sun, Adam Cross

Clinical decisions are high-stakes and require explicit justification, making model interpretability essential for auditing deep clinical models prior to deployment. As the ecosystem of model architectures and explainability methods expands, critical questions remain: Do architectural features like attention improve explainability? Do interpretability approaches generalize across clinical tasks? While prior benchmarking efforts exist, they often lack extensibility and reproducibility, and critically, fail to systematically examine how interpretability varies across the interplay of clinical tasks and model architectures. To address these gaps, we present a comprehensive benchmark evaluating interpretability methods across diverse clinical prediction tasks and model architectures. Our analysis reveals that: (1) attention when leveraged properly is a highly efficient approach for faithfully interpreting model predictions; (2) black-box interpreters like KernelSHAP and LIME are computationally infeasible for time-series clinical prediction tasks; and (3) several interpretability approaches are too unreliable to be trustworthy. From our findings, we discuss several guidelines on improving interpretability within clinical predictive pipelines. To support reproducibility and extensibility, we provide our implementations via PyHealth, a well-documented open-source framework: https://github.com/sunlabuiuc/PyHealth.

摘要:臨床決策是高風險的,並需要明確的理由,這使得模型可解釋性在部署之前對深度臨床模型的審核變得至關重要。隨著模型架構和可解釋性方法的生態系統不斷擴展,仍然存在一些關鍵問題:像注意力這樣的架構特徵是否能提高可解釋性?可解釋性方法是否能在不同的臨床任務中通用?雖然之前的基準測試工作已經存在,但它們往往缺乏可擴展性和可重現性,並且關鍵的是,未能系統性地檢查可解釋性在臨床任務和模型架構之間的相互作用中是如何變化的。為了填補這些空白,我們提出了一個全面的基準,評估不同臨床預測任務和模型架構下的可解釋性方法。我們的分析揭示了: (1) 當適當利用時,注意力是一種非常有效的方法,可以忠實地解釋模型預測; (2) 像KernelSHAP和LIME這樣的黑箱解釋器在時間序列臨床預測任務中計算上不可行; (3) 幾種可解釋性方法過於不可靠,無法被信任。根據我們的發現,我們討論了幾條改善臨床預測管道中可解釋性的指導方針。為了支持可重現性和可擴展性,我們通過PyHealth提供我們的實現,這是一個文檔完善的開源框架: https://github.com/sunlabuiuc/PyHealth。

HASS: Hierarchical Simulation of Logopenic Aphasic Speech for Scalable PPA Detection

2603.26795v1 by Harrison Li, Kevin Wang, Cheol Jun Cho, Jiachen Lian, Rabab Rangwala, Chenxu Guo, Emma Yang, Lynn Kurteff, Zoe Ezzes, Willa Keegan-Rodewald, Jet Vonk, Siddarth Ramkrishnan, Giada Antonicelli, Zachary Miller, Marilu Gorno Tempini, Gopala Anumanchipalli

Building a diagnosis model for primary progressive aphasia (PPA) has been challenging due to the data scarcity. Collecting clinical data at scale is limited by the high vulnerability of clinical population and the high cost of expert labeling. To circumvent this, previous studies simulate dysfluent speech to generate training data. However, those approaches are not comprehensive enough to simulate PPA as holistic, multi-level phenotypes, instead relying on isolated dysfluencies. To address this, we propose a novel, clinically grounded simulation framework, Hierarchical Aphasic Speech Simulation (HASS). HASS aims to simulate behaviors of logopenic variant of PPA (lvPPA) with varying degrees of severity. To this end, semantic, phonological, and temporal deficits of lvPPA are systematically identified by clinical experts, and simulated. We demonstrate that our framework enables more accurate and generalizable detection models.

摘要:建立初級進行性失語症(PPA)的診斷模型一直面臨數據稀缺的挑戰。大規模收集臨床數據受到臨床人群高脆弱性和專家標註高成本的限制。為了繞過這一問題,以前的研究模擬流暢性障礙的語言來生成訓練數據。然而,這些方法並不足以全面模擬PPA作為整體的、多層次的表型,而是依賴於孤立的流暢性障礙。為了解決這一問題,我們提出了一個新穎的、以臨床為基礎的模擬框架,稱為層級失語症語言模擬(HASS)。HASS旨在模擬PPA的語言變異型(lvPPA)在不同嚴重程度下的行為。為此,lvPPA的語義、語音和時間缺陷由臨床專家系統性地識別並進行模擬。我們證明了我們的框架能夠實現更準確且可泛化的檢測模型。

PhyDCM: A Reproducible Open-Source Framework for AI-Assisted Brain Tumor Classification from Multi-Sequence MRI

2603.26794v1 by Hayder Saad Abdulbaqi, Mohammed Hadi Rahim, Mohammed Hassan Hadi, Haider Ali Aboud, Ali Hussein Allawi

MRI-based medical imaging has become indispensable in modern clinical diagnosis, particularly for brain tumor detection. However, the rapid growth in data volume poses challenges for conventional diagnostic approaches. Although deep learning has shown strong performance in automated classification, many existing solutions are confined to closed technical architectures, limiting reproducibility and further academic development. PhyDCM is introduced as an open-source software framework that integrates a hybrid classification architecture based on MedViT with standardized DICOM processing and an interactive desktop visualization interface. The system is designed as a modular digital library that separates computational logic from the graphical interface, allowing independent modification and extension of components. Standardized preprocessing, including intensity rescaling and limited data augmentation, ensures consistency across varying MRI acquisition settings. Experimental evaluation on MRI datasets from BRISC2025 and curated Kaggle collections (FigShare, SARTAJ, and Br35H) demonstrates stable diagnostic performance, achieving over 93% classification accuracy across categories. The framework supports structured, exportable outputs and multi-planar reconstruction of volumetric data. By emphasizing transparency, modularity, and accessibility, PhyDCM provides a practical foundation for reproducible AI-driven medical image analysis, with flexibility for future integration of additional imaging modalities.

摘要:MRI 基於醫學影像在現代臨床診斷中已變得不可或缺,特別是在腦腫瘤檢測方面。然而,數據量的快速增長對傳統診斷方法提出了挑戰。儘管深度學習在自動分類中顯示出強大的性能,但許多現有解決方案受限於封閉的技術架構,限制了可重複性和進一步的學術發展。PhyDCM 被引入作為一個開源軟體框架,整合了基於 MedViT 的混合分類架構、標準化的 DICOM 處理和互動式桌面可視化界面。該系統被設計為一個模組化的數位圖書館,將計算邏輯與圖形界面分離,允許獨立修改和擴展組件。標準化的預處理,包括強度重新縮放和有限的數據增強,確保在不同的 MRI 獲取設置中保持一致性。對來自 BRISC2025 和策劃的 Kaggle 收藏(FigShare、SARTAJ 和 Br35H)的 MRI 數據集的實驗評估顯示出穩定的診斷性能,在各類別中達到超過 93% 的分類準確率。該框架支持結構化、可導出的輸出和體積數據的多平面重建。通過強調透明性、模組化和可及性,PhyDCM 為可重複的 AI 驅動醫學影像分析提供了實用的基礎,並具備未來整合其他影像模式的靈活性。

Dissecting Model Failures in Abdominal Aortic Aneurysm Segmentation through Explainability-Driven Analysis

2603.24801v1 by Abu Noman Md Sakib, Merjulah Roby, Zijie Zhang, Satish Muluk, Mark K. Eskandari, Ender A. Finol

Computed tomography image segmentation of complex abdominal aortic aneurysms (AAA) often fails because the models assign internal focus to irrelevant structures or do not focus on thin, low-contrast targets. Where the model looks is the primary training signal, and thus we propose an Explainable AI (XAI) guided encoder shaping framework. Our method computes a dense, attribution-based encoder focus map ("XAI field") from the final encoder block and uses it in two complementary ways: (i) we align the predicted probability mass to the XAI field to promote agreement between focus and output; and (ii) we route the field into a lightweight refinement pathway and a confidence prior that modulates logits at inference, suppressing distractors while preserving subtle structures. The objective terms serve only as control signals; the contribution is the integration of attribution guidance into representation and decoding. We evaluate clinically validated challenging cases curated for failure-prone scenarios. Compared to a base SAM setup, our implementation yields substantial improvements. The observed gains suggest that explicitly optimizing encoder focus via XAI guidance is a practical and effective principle for reliable segmentation in complex scenarios.

摘要:計算機斷層掃描影像分割複雜的腹主動脈瘤(AAA)常常失敗,因為模型將內部焦點分配給不相關的結構,或未能專注於薄且低對比度的目標。模型的觀察位置是主要的訓練信號,因此我們提出了一種可解釋的人工智慧(XAI)引導編碼器塑形框架。我們的方法從最終的編碼器區塊計算出一個密集的、基於歸因的編碼器焦點圖(“XAI場”),並以兩種互補的方式使用它:(i)我們將預測的概率質量與XAI場對齊,以促進焦點與輸出之間的一致性;以及(ii)我們將該場路由到一個輕量級的精煉路徑和一個信心先驗,在推理時調節邏輯,壓制干擾物,同時保留微妙的結構。目標項僅作為控制信號;其貢獻在於將歸因引導整合到表示和解碼中。我們評估了經臨床驗證的挑戰性案例,這些案例是為容易失敗的情境精心策劃的。與基礎的SAM設置相比,我們的實現產生了顯著的改進。觀察到的增益表明,通過XAI引導明確優化編碼器焦點是一個實用且有效的原則,能夠在複雜情境中實現可靠的分割。

Evaluating Fine-Tuned LLM Model For Medical Transcription With Small Low-Resource Languages Validated Dataset

2603.24772v1 by Mohammed Nowshad Ruhani Chowdhury, Mohammed Nowaz Rabbani Chowdhury, Sakari Lukkarinen

Clinical documentation is a critical factor for patient safety, diagnosis, and continuity of care. The administrative burden of EHRs is a significant factor in physician burnout. This is a critical issue for low-resource languages, including Finnish. This study aims to investigate the effectiveness of a domain-aligned natural language processing (NLP); large language model for medical transcription in Finnish by fine-tuning LLaMA 3.1-8B on a small validated corpus of simulated clinical conversations by students at Metropolia University of Applied Sciences. The fine-tuning process for medical transcription used a controlled preprocessing and optimization approach. The fine-tuning effectiveness was evaluated by sevenfold cross-validation. The evaluation metrics for fine-tuned LLaMA 3.1-8B were BLEU = 0.1214, ROUGE-L = 0.4982, and BERTScore F1 = 0.8230. The results showed a low n-gram overlap but a strong semantic similarity with reference transcripts. This study indicate that fine-tuning can be an effective approach for translation of medical discourse in spoken Finnish and support the feasibility of fine-tuning a privacy-oriented domain-specific large language model for clinical documentation in Finnish. Beside that provide directions for future work.

摘要:臨床文檔是患者安全、診斷和持續護理的關鍵因素。電子健康紀錄(EHR)的行政負擔是醫生倦怠的重要因素。這對於資源有限的語言來說是一個關鍵問題,包括芬蘭語。本研究旨在通過在Metropolia應用科學大學的學生模擬臨床對話的小型驗證語料庫上微調LLaMA 3.1-8B,調查對芬蘭語醫療轉錄的領域對齊自然語言處理(NLP)大型語言模型的有效性。醫療轉錄的微調過程使用了受控的預處理和優化方法。微調的有效性通過七折交叉驗證進行評估。微調後的LLaMA 3.1-8B的評估指標為BLEU = 0.1214,ROUGE-L = 0.4982,和BERTScore F1 = 0.8230。結果顯示n-gram重疊較低,但與參考轉錄之間有強烈的語義相似性。本研究表明,微調可以是一種有效的芬蘭語口語醫療話語翻譯方法,並支持微調針對隱私的領域特定大型語言模型在芬蘭語臨床文檔中的可行性。此外,還提供了未來工作的方向。

Pseudo Label NCF for Sparse OHC Recommendation: Dual Representation Learning and the Separability Accuracy Trade off

2603.24750v2 by Pronob Kumar Barman, Tera L. Reynolds, James Foulds

Online Health Communities connect patients for peer support, but users face a discovery challenge when they have minimal prior interactions to guide personalization. We study recommendation under extreme interaction sparsity in a survey driven setting where each user provides a 16 dimensional intake vector and each support group has a structured feature profile. We extend Neural Collaborative Filtering architectures, including Matrix Factorization, Multi Layer Perceptron, and NeuMF, with an auxiliary pseudo label objective derived from survey group feature alignment using cosine similarity mapped to [0, 1]. The resulting Pseudo Label NCF learns dual embedding spaces: main embeddings for ranking and pseudo label embeddings for semantic alignment. We evaluate on a dataset of 165 users and 498 support groups using a leave one out protocol that reflects cold start conditions. All pseudo label variants improve ranking performance: MLP improves HR@5 from 2.65% to 5.30%, NeuMF from 4.46% to 5.18%, and MF from 4.58% to 5.42%. Pseudo label embedding spaces also show higher cosine silhouette scores than baseline embeddings, with MF improving from 0.0394 to 0.0684 and NeuMF from 0.0263 to 0.0653. We further observe a negative correlation between embedding separability and ranking accuracy, indicating a trade off between interpretability and performance. These results show that survey derived pseudo labels improve recommendation under extreme sparsity while producing interpretable task specific embedding spaces.

摘要:線上健康社群連結病患以獲得同儕支持,但使用者在面對最少的先前互動以指導個性化時,會遇到發現的挑戰。我們研究在極端互動稀疏的情況下的推薦,這是一個由調查驅動的環境,每位使用者提供一個16維的輸入向量,每個支持小組都有一個結構化的特徵檔案。我們擴展了神經協作過濾架構,包括矩陣分解、多層感知器和NeuMF,並使用從調查小組特徵對齊中衍生的輔助偽標籤目標,該目標利用餘弦相似度映射到[0, 1]。最終的偽標籤NCF學習了雙重嵌入空間:主要嵌入用於排名,偽標籤嵌入用於語義對齊。我們在一個包含165位使用者和498個支持小組的數據集上進行評估,使用一種反映冷啟動條件的留一法協議。所有偽標籤變體均提高了排名性能:MLP將HR@5從2.65%提高到5.30%,NeuMF從4.46%提高到5.18%,MF從4.58%提高到5.42%。偽標籤嵌入空間的餘弦輪廓分數也顯示出比基線嵌入更高的分數,MF從0.0394提高到0.0684,NeuMF從0.0263提高到0.0653。我們進一步觀察到嵌入可分性與排名準確性之間存在負相關,這表明可解釋性與性能之間存在權衡。這些結果顯示,從調查衍生的偽標籤在極端稀疏情況下改善了推薦,同時產生可解釋的任務特定嵌入空間。