Medical explainable AI

Publish Date	Title	Authors	Homepage	Code
2026-06-17	Explaining Attention with Program Synthesis	Amiri Hayes et.al.	2606.19317v1	null
2026-06-17	A Taxonomy of Mental Health and Technology Needs for Alzheimer's and Dementia Caregivers	Keran Wang et.al.	2606.19247v1	null
2026-06-17	The More the Merrier: Combining Properties for ABox Abduction under Repair Semantics for ELbot	Anselm Haak et.al.	2606.19197v1	null
2026-06-17	A Clinician-Centered Pipeline for Annotation and Evaluation in Ultrasound AI Studies	Fangyijie Wang et.al.	2606.19174v1	null
2026-06-17	Human-AI Coevolution Dynamics: A Formal Theory of Social Intelligence Emergence Through Long-Term Interaction	Jingyi Zhou et.al.	2606.19144v1	null
2026-06-17	Analysing drivers and interdependencies in European electricity markets using XAI	Antoine Pesenti et.al.	2606.19118v1	null
2026-06-17	APT: Atomic Physical Transitions for Causal Video-Language Understanding	Shang Wu et.al.	2606.18586v1	null
2026-06-17	DeFAb: A Verifiable Benchmark for Defeasible Abduction in Foundation Models	Patrick Cooper et.al.	2606.18557v1	null
2026-06-16	PSyGenTAB: A Privacy-Preserving Framework for Synthetic Clinical Tabular Data Generation via Constrained Optimization	Arshia Ilaty et.al.	2606.18518v1	null
2026-06-16	From Specification to Execution: AI Assisted Scientific Workflow Management	Komal Thareja et.al.	2606.18425v1	null
2026-06-16	RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills	Weizhi Zhang et.al.	2606.18203v1	null
2026-06-16	Towards Understanding and Measuring COGNITIVE ATROPHY in LLM Behaviour	Abeer Badawi et.al.	2606.18129v1	null
2026-06-16	Agentic AI-based Framework for Mitigating Premature Diagnostic Handoff and Silent Hallucination in Healthcare Applications	Divyansh Srivastava et.al.	2606.18068v1	null
2026-06-16	When LLMs Analyze Scars: From Images to Clinically-Meaningful Features	Ruman Wang et.al.	2606.18063v1	null
2026-06-16	Catastrophic Forgetting is Low-Rank: A Function-Space Theory for Continual Adaptation	Ido Nitzan Hidekel et.al.	2606.18024v1	null
2026-06-16	LoopCoder-v2: Only Loop Once for Efficient Test-Time Computation Scaling	Jian Yang et.al.	2606.18023v1	null
2026-06-16	A Quantitative Analysis of Multimodal Biomarkers in Alzheimer's Disease	Antonio Scardace et.al.	2606.17867v1	null
2026-06-16	Conservation Laws for Modern Neural Architectures	Viet-Hoang Tran et.al.	2606.17816v1	null
2026-06-16	EComAgentBench: Benchmarking Shopping Agents on Long-Horizon Tasks with Distributed Hidden Intent	Zeyao Du et.al.	2606.17698v1	null
2026-06-16	From Brewing to Resolution: Tracing the Internal Lifecycle of Code Reasoning in LLMs	Siyue Chen et.al.	2606.17648v1	null
2026-06-16	SketchXplain: Intuitive Visual Explanations of Image Classifiers with Sketches	Wencan Zhang et.al.	2606.17646v1	null
2026-06-16	Offline Preference-Based Trajectory Evaluation	Fernando Diaz et.al.	2606.17541v1	null
2026-06-16	Decoding Hidden Deception in Reasoning LLMs: Activation Explainers for Deception Auditing	Kexin Chen et.al.	2606.17478v1	null
2026-06-16	Treatment Response Optimized Clinical Decision Support AI System via Digital Twin Simulation	Xinyu Qin et.al.	2606.17405v1	null
2026-06-15	SpeechDx: A Multi-Task Benchmark for Clinical Speech AI	Sejal Bhalla et.al.	2606.17339v1	null
2026-06-15	Phantoms and Disclosures: a Causal Framework for Auditing Synthetic Data	Kareem Amin et.al.	2606.16952v1	null
2026-06-15	Demystifying Variance in Circuit Discovery of LLMs	Frank Zhengqing Wu et.al.	2606.16920v1	null
2026-06-15	Symbolic Informalization: Fluent, Productive, Multilingual	Aarne Ranta et.al.	2606.16893v1	null
2026-06-15	Compositional Reasoning Depth Predicts Clinical AI Failure: Empirical Evidence Consistent with Transformer Compositionality Limits in Electronic Health Record Question Answering	Sanjay Basu et.al.	2606.16890v1	null
2026-06-15	Medical world models: representing medical states, modelling clinical dynamics and guiding intervention policies	Ke Liu et.al.	2606.16721v1	null
2026-06-15	Is Your Trajectory Displacement Safe in Long-tail?	Qiao Sun et.al.	2606.16313v1	null
2026-06-15	PACT: Privileged Trace Co-Training for Multi-Turn Tool-Use Agents	Zhenbang Du et.al.	2606.16215v1	null
2026-06-15	Embedded Arena: Iterative Optimization via Hardware Feedback	Zhihan Zhang et.al.	2606.16190v1	null
2026-06-15	LiteOdyssey: A Lightweight Reasoning AI Agent for Interpretable Rare-Disease Diagnosis	Minh-Ha Nguyen et.al.	2606.16149v1	null
2026-06-15	XAI-Grounded Explanation Generation for Speech Deepfake Detection with Training-Free Multimodal Large Language Models	Yupei Li et.al.	2606.16137v1	null
2026-06-14	SciText2Eq: Assessing LLMs for Explainable Equation Generation for Scientific Creativity	Yifan Mo et.al.	2606.16003v1	null
2026-06-14	Entity Labels Are Not Entity Signals: A Framework for Observable Relevance in Document Re-Ranking	Utshab Kumar Ghosh et.al.	2606.15998v1	null
2026-06-14	DeepRoot: A KG-Coordinated Multi-Agent System for Therapeutic Reasoning over Historical Medical Texts	Zijian Carl Ma et.al.	2606.15931v1	null
2026-06-14	DifFRACT: Diffusion Feature Reconstruction and Attribution for Circuit Tracing	Artyom Mazur et.al.	2606.15796v1	null
2026-06-14	Visualizing Uncertainty: Spatial Maps of Missing and Conflicting Evidence in Deep Learning	Dong Hyun Jeong et.al.	2606.15767v1	null
2026-06-14	InstantForget: Update-Free Backdoor Unlearning with Inference-Time Feature Reset	Zhenyu Yu et.al.	2606.15730v1	null
2026-06-14	AI-Driven Framework for Adaptive Water Network Management with Proof-of-Concept Implementation: Addressing Non-Revenue Water in Jordan	Mohammed Fasha et.al.	2606.15709v1	null
2026-06-14	Quantum Cinema: An Interactive Cinematic Exploration of Quantum Computing Hardware via Generative World Models	Aoyu Zhang et.al.	2606.17102v1	null
2026-06-14	Is Code Better Than Language for Algorithmic Reasoning	Terry Tong et.al.	2606.15589v1	null
2026-06-14	Service-Induced Congestion in Memory-Constrained LLM Serving	Ruicheng Ao et.al.	2606.15555v1	null
2026-06-13	Let LLMs Judge Each Other: Multi-Agent Peer-Reviewed Reasoning for Medical Question Answering	Zaifu Zhan et.al.	2606.15419v1	null
2026-06-13	APEX: Adaptive Principle EXtraction A Three-Layer Self-Evolution Framework for Production AI Agents	Ya-Chuan Chen et.al.	2606.15363v1	null
2026-06-13	Adapting Reinforcement Learning with Chain-of-Thought Supervision for Explainable Detection of Hateful and Propagandistic Memes	Mohamed Bayan Kmainasi et.al.	2606.15307v1	null
2026-06-13	Enabling Real-Time Point-of-Care Ultrasound Segmentation: A GPU-Free Deployment in Resource-Limited Settings	Weihao Gao et.al.	2606.15176v1	null
2026-06-12	CottonLeafVision: An Explainable and Robust Deep Learning Framework for Cotton Leaf Disease Classification	Rafi Ahamed et.al.	2606.14686v1	null
2026-06-12	A Definition of Good Explanations and the Challenges Explaining LLM Outputs	Louis Mahon et.al.	2606.14838v1	null
2026-06-12	Listening with Attention: Entropy-Guided Explainability for Transformer-Based Audio Models	Ravi Ranjan et.al.	2606.14647v1	null
2026-06-12	Fodor and Pylyshyn's Systematicity Challenge Still Stands	Michael Goodale et.al.	2606.14512v1	null
2026-06-12	Learning Urban Access Costs from Origin-Destination Flows via Inverse Optimal Transport	Paula Joy B. Martinez et.al.	2606.14157v1	null
2026-06-12	Recovering Stranded Discrimination in Knowledge Tracing: Per-Item Bias Correction via Empirical-Bayes Shrinkage	Xiaoran Yan et.al.	2606.14123v1	null
2026-06-11	How do Self-Supervised Remote Sensing Vision Models Transfer to Downstream Tasks?	Julia Romero et.al.	2606.13896v1	null
2026-06-11	Explaining RhythmFormer: A Systematic XAI Analysis of Periodic Sparse Attention for Remote Photoplethysmography	Louis Chen et.al.	2606.13839v1	null
2026-06-11	ArogyaSutra: A Multi-Agent Framework for Multimodal Medical Reasoning in Indic Languages	Tanmoy Kanti Halder et.al.	2606.13572v1	null
2026-06-11	Is It You or Your Environment? A Bayesian Inference Framework for Genomically-Anchored Personalized Physiological Interpretation	Aruna Dey et.al.	2606.13556v2	null
2026-06-11	Physics-Guided Spatiotemporal Learning for Coastal Wave Peak Period Estimation from Video	Abubakar Hamisu Kamagata et.al.	2606.13302v1	null
2026-06-11	Hallucination in Medical Imaging AI: A Cross-Modality Analytical Framework for Taxonomy, Detection, and Mitigation under Regulatory Constraints	Omar Alshahrani et.al.	2606.13211v1	null
2026-06-11	Cascade Classification of Dermoscopic Images of Skin Neoplasms with Controllable Sensitivity and External Clinical Validation	Elena S. Kozachok et.al.	2606.13135v1	null
2026-06-11	Zero-source LLM Hallucination Detection with Human-like Criteria Probing	Jiahao Yang et.al.	2606.12900v1	null
2026-06-11	PolicyGuard: Towards Test-time and Step-level Adversary Defense for Reinforcement Learning Agent	Junfeng Guo Heng Huang et.al.	2606.12896v1	null
2026-06-11	Acquisition state behaves as a structured, measurable variable governing lung-nodule AI: kernel-driven measurement instability and noise-driven detection fragility, invisible to DICOM metadata	Daniel Soliman et.al.	2606.12824v1	null
2026-06-10	LLM-Powered Personalized Glycemic Assessment in Type 2 Diabetes with Wearable Sensor Data	Yifan Gao et.al.	2606.12699v1	null
2026-06-10	Atlas H&E-TME: Scalable AI-Based Tissue Profiling at Expert Pathologist-Level Accuracy	Kai Standvoss et.al.	2606.12346v1	null
2026-06-10	Using Explainability as a Training-Time Reliability Signal for Efficient ECG Classification	Veerendhra Kumar Dangeti et.al.	2606.12252v1	null
2026-06-10	Reinforcement Learning Disrupts Gradient-Based Adversarial Optimization	Xinhai Zou et.al.	2606.12251v1	null
2026-06-10	Towards Responsibly Non-Compliant Machines	Marija Slavkovik et.al.	2606.12147v1	null
2026-06-10	Unstable Features, Reproducible Subspaces: Understanding Seed Dependence in Sparse Autoencoders	Gleb Gerasimov et.al.	2606.12138v1	null
2026-06-10	Tabular Foundation Models for Clinical Survival Analysis via Survival-Aware Adaptation	Minh-Khoi Pham et.al.	2606.12006v1	null
2026-06-10	Frozen Multimodal Embeddings for AI-Assisted Interview Assessment of Personality and Cognitive Ability	Kuo-En Hung et.al.	2606.11930v2	null
2026-06-10	Beyond representational alignment with brain-guided language models for robust reasoning	Mingqing Xiao et.al.	2606.11893v1	null
2026-06-10	Skill-Augmented AI Agents for Medical Research Analysis: An Exploratory Multi-Model Human Evaluation in an NSCLC Transcriptomic Biomarker Task	Qianyu Yao et.al.	2606.11830v1	null
2026-06-10	Multimodal Ordinal Modeling of Alzheimer's Disease Severity Using Structural MRI and Clinical Data	Boris-Stephan Rauchmann et.al.	2606.11794v1	null
2026-06-10	Extracting Semantics: LLM-Guided Automatic Population of Robot Ontology from URDF	Bastien Dussard et.al.	2606.17073v1	null
2026-06-10	MedCTA: A Benchmark for Clinical Tool Agents	Tajamul Ashraf et.al.	2606.11702v1	null
2026-06-10	Quickest Detection of Hallucination Onset: Delay Bounds and Learned CUSUM Statistics	Igor Itkin et.al.	2606.12476v2	null
2026-06-09	Can AI Agents Synthesize Scientific Conclusions?	Hayoung Jung et.al.	2606.11337v1	null
2026-06-09	Designed by Journalists, but Is It for Readers? Rethinking AI Disclosures and Transparency in News	Pooja Prajod et.al.	2606.11116v1	null
2026-06-09	FADA: Accessible fetal ultrasound interpretation and annotation with a selectively distilled unified vision-language model	Mahmood Alzubaidi et.al.	2606.11106v1	null
2026-06-09	Superficial Beliefs in LLM Decision-Making	Gabriel Freedman et.al.	2606.11016v1	null
2026-06-09	Understanding and mitigating the risks of OpenClaw for non-technical users: A practical guide with Skill	Junchang Zheng et.al.	2606.11007v1	null
2026-06-09	Generative Explainability for Next-Generation Networks: LLM-Augmented XAI with Mutual Feature Interactions	Kiarash Rezaei et.al.	2606.10942v1	null
2026-06-09	What Do Deepfake Speech Detectors Actually Hear?	Vojtěch Staněk et.al.	2606.10912v1	null
2026-06-09	Accelerating NeurASP with vectorization and caching	Alexander Philipp Rader et.al.	2606.10787v1	null
2026-06-09	From Data Heterogeneity to Convergence: A Data-Centric Review of Federated Learning	Huong Nguyen et.al.	2606.10595v1	null
2026-06-09	Towards Critical Branching Mechanism in Recurrent Neural Networks	Feixiang Ren et.al.	2606.10384v1	null
2026-06-09	Supervised Fine-tuning with Synthetic Rationale Data Hurts Real-World Disease Prediction	Buxin Su et.al.	2606.10279v1	null
2026-06-08	Gaming AI-Assisted Peer Reviews Poses New Risks to the Scientific Community	Lin Li et.al.	2606.10159v1	null
2026-06-08	XMedFusion: A Knowledge-Guided Multimodal Perception and Reasoning Framework for Autonomous Medical Systems	Hamza Riaz et.al.	2606.14766v1	null
2026-06-08	Hybrid Robustness Verification for Spatio-Temporal Neural Networks	Sherwin Varghese et.al.	2606.09746v1	null
2026-06-08	Correlation Is Not Enough: Embedding Human Metadata for Individual Causal Discovery	Suraj Biswas et.al.	2606.09672v1	null
2026-06-08	Transition-Based Digital Twin Modelling for Alzheimer's Disease under Sparse Longitudinal Data	Yinyu Huang et.al.	2606.09671v1	null
2026-06-08	Self-Explainability in Self-Adaptive and Self-Organising Systems: Status and Research Directions	Tom Beyer et.al.	2606.09568v1	null
2026-06-08	Capacity, Not Format: Rethinking Structured Reasoning Failures	Hengxin Fan et.al.	2606.09410v1	null
2026-06-08	Disentangling Hallucinations: Orthogonal Semantic Projection for Robust Interpretability	Emirhan Bilgiç et.al.	2606.14758v1	null
2026-06-08	TRIAGE: Dialectical Reasoning for Explainable Risk Prediction on Irregularly Sampled Medical Time Series with LLMs	Hyeongwon Jang et.al.	2606.09030v1	null
2026-06-08	Understanding Quantization-Aware Training: Gradients at Quantized Weights Bias to the Low-Loss Basin	Hanyang Li et.al.	2606.09012v1	null

Abstracts

Explaining Attention with Program Synthesis

2606.19317v1 by Amiri Hayes, Belinda Li, Jacob Andreas

A longstanding goal of research on interpretable deep learning is to replace opaque neural computations with human-meaningful symbolic descriptions. In this paper, we propose an approach for approximating the behavior of components of deep networks with executable programs. We focus on attention heads in transformer language models. For a given head, we first compute its associated attention matrices on a collection of randomly selected training examples. Next, we prompt a pre-trained language model with a summary of these matrices, and instruct it to generate a set of Python programs that can reproduce the associated attention patterns given only text from the input sentence. Finally, we re-rank programs according to how well our final set of programs predict behavior on held-out inputs. We demonstrate that a set of fewer than 1,000 such generated programs can reproduce the attention patterns of heads in GPT-2, TinyLlama-1.1B, and Llama-3B, achieving an average Intersection-over-Union similarity above 75% on TinyStories. Moreover, the best-fit programs can replace neural attention heads without substantially affecting model behavior: replacing 25% of attention heads with programmatic surrogates across the three models incurs only a 16% average perplexity increase, while maintaining performance on a variety of downstream question answering benchmarks. This work contributes a scalable pipeline for reverse-engineering attention heads in transformer models using human-readable, executable code, advancing a path toward symbolic transparency in neural models.

摘要：一項長期以來的研究目標是用人類可理解的符號描述來取代不透明的神經計算。在本文中，我們提出了一種方法，用可執行程序來近似深度網絡組件的行為。我們專注於Transformer語言模型中的注意力頭。對於給定的頭，我們首先在一組隨機選擇的訓練示例上計算其相關的注意力矩陣。接下來，我們用這些矩陣的摘要來提示一個預訓練的語言模型，並指示它生成一組 Python 程序，這些程序可以僅根據輸入句子的文本重現相關的注意力模式。最後，我們根據我們的最終程序集在保留輸入上的行為預測的準確性來重新排序這些程序。我們證明，少於 1,000 個這樣生成的程序可以重現 GPT-2、TinyLlama-1.1B 和 Llama-3B 中頭部的注意力模式，在 TinyStories 上達到超過 75% 的平均交集-聯合相似度。此外，最佳擬合的程序可以在不顯著影響模型行為的情況下替代神經注意力頭：在三個模型中用程序替代品替換 25% 的注意力頭僅會產生 16% 的平均困惑度增加，同時在各種下游問題回答基準上保持性能。這項工作貢獻了一個可擴展的管道，用於使用人類可讀的可執行代碼逆向工程Transformer模型中的注意力頭，推進了神經模型中符號透明度的道路。

A Taxonomy of Mental Health and Technology Needs for Alzheimer's and Dementia Caregivers

2606.19247v1 by Keran Wang, Drishti Goel, Jiayue Melissa Shi, Violeta J. Rodriguez, Daniel S. Brown, Dong Whi Yoo, Ravi Karkar, Koustuv Saha

Family members caring for individuals with Alzheimer's disease and related dementias (AD/ADRD) provide the foundation of long-term care worldwide. In 2023, more than 11 million U.S. family and friends contributed 18 billion hours of unpaid care, often at the cost of their own physical and mental health. These informal caregivers -- also referred as the "invisible second patients" -- experience elevated rates of mental health problems. Yet research commonly reduces their complex psychosocial experiences to a single construct of caregiver burden, obscuring which specific needs are unmet or effectively supported. At the same time, digital and AI-enabled technologies are rapidly expanding, from smartphone apps and videoconferencing to sensor platforms and AI chatbots. However, the absence of shared frameworks across medicine, psychology, and technology research limits cumulative progress. This study introduces a Caregiver Mental Health and Technology Taxonomy that systematically links AD/ADRD caregiver needs with corresponding classes of technology-based interventions. Drawing from an interdisciplinary literature review and two qualitative studies with caregivers, the taxonomy identifies mismatches between caregiver priorities and existing technological support, highlights under-served domains such as relational strain and compassion fatigue, and proposes design directions for adaptive, responsive systems. The framework offers a shared vocabulary to guide clinicians, researchers, and technology designers in developing more person-centered and clinically grounded innovation in dementia care.

摘要：家庭成員照顧阿茲海默症及相關癡呆症（AD/ADRD）患者，為全球長期照護提供了基礎。在2023年，超過1100萬名美國家庭成員和朋友貢獻了180億小時的無償照護，這往往以他們自身的身心健康為代價。這些非正式的照護者——也被稱為「隱形的第二患者」——經歷著較高的心理健康問題發生率。然而，研究通常將他們複雜的心理社會經驗簡化為單一的照護者負擔構念，模糊了哪些特定需求未被滿足或有效支持。同時，數位和人工智慧技術正在迅速擴展，從智慧型手機應用程式和視訊會議到感應平台和人工智慧聊天機器人。然而，醫學、心理學和技術研究之間缺乏共享框架，限制了累積進展。本研究介紹了一個照護者心理健康與技術分類法，系統性地將AD/ADRD照護者需求與相應的技術干預類別聯繫起來。該分類法基於跨學科的文獻回顧和兩項與照護者的質性研究，識別出照護者優先事項與現有技術支持之間的不匹配，突顯了如關係緊張和同情疲勞等被忽視的領域，並提出了適應性、響應性系統的設計方向。該框架提供了一個共享的詞彙，以指導臨床醫生、研究人員和技術設計師在癡呆症照護中開發更以人為中心且臨床基礎的創新。

The More the Merrier: Combining Properties for ABox Abduction under Repair Semantics for ELbot

2606.19197v1 by Anselm Haak, Patrick Koopmann, Yasir Mahmood, Anni-Yasmin Turhan

Abduction is a central approach to explain missing entailments from a knowledge base by providing a hypothesis, that would, if added to the knowledge base, make the missing entailment become true. Abduction under repair semantics has recently been investigated in detail, where several desirable properties and optimality criteria were considered, such as signature-restrictions and minimality in size and of introduced conflicts. Naturally, hypotheses that satisfy more than one of these properties or combine a property with an optimality criterion would be even more desirable for applications. So far, such hypotheses have not been investigated in the literature. In the present paper, we consider the ABox abduction problem for hypotheses satisfying more than one property or additional optimality criteria, for EL_bot under brave and AR semantics. Our main observation is that often requiring additional properties for hypotheses does not lead to an increase of complexity.

摘要：誘導推理是一種中心方法，用於解釋知識庫中缺失的推論，通過提供一個假設，如果將其添加到知識庫中，將使缺失的推論變得成立。最近，修復語義下的誘導推理已被詳細研究，其中考慮了幾個理想的特性和最佳性標準，例如簽名限制和引入衝突的最小化。自然地，滿足多個這些特性或將一個特性與最佳性標準結合的假設對於應用來說會更加理想。到目前為止，文獻中尚未對此類假設進行研究。在本篇論文中，我們考慮了針對滿足多個特性或額外最佳性標準的假設的 ABox 誘導推理問題，針對 EL_bot 在勇敢和 AR 語義下的情況。我們的主要觀察是，通常要求假設具備額外的特性並不會導致複雜性的增加。

A Clinician-Centered Pipeline for Annotation and Evaluation in Ultrasound AI Studies

2606.19174v1 by Fangyijie Wang, Jianjun Yu, Wentao Shi, Haixia Huang, Ran Shi, Guénolé Silvestre, Kathleen M. Curran

Clinician-centered evaluation is critical for validating medical AI systems, especially in ultrasound imaging where quantitative metrics do not always capture clinical usability. Existing medical image platforms primarily focus on dataset labeling. They lack integrated support for blinded model comparison and reproducible evaluation workflows. We present a clinician-centered pipeline for remote annotation and evaluation in ultrasound AI studies. The proposed pipeline uses a centralized server and lightweight browser interfaces to enable clinicians to perform annotation, blinded ranking, and review without local dataset downloads. The pipeline also supports multi-rater participation, centralized result aggregation, and automated statistical analysis. We validate the pipeline in a fetal ultrasound segmentation study with six raters spanning expert, generalist, and non-expert experience levels. The system automatically generated Spearman correlation, Kendall's $τ$, and top-1 selection statistics. Results indicated moderate to strong agreement across experts and other groups. The blinded evaluation results showed a tendency for later active learning models to be preferred. These outcomes suggest that the pipeline can support clinician-centered annotation and reproducible human-\ac{AI} evaluation studies in ultrasound imaging. The proposed pipeline is available on \href{https://github.com/13204942/SonoRate}{GitHub}.

摘要：臨床醫師中心的評估對於驗證醫療人工智慧系統至關重要，尤其是在超聲影像中，定量指標並不總是能夠捕捉臨床可用性。現有的醫療影像平台主要集中於數據集標註。它們缺乏對盲測模型比較和可重複評估工作流程的綜合支持。我們提出了一個臨床醫師中心的管道，用於遠程標註和超聲人工智慧研究中的評估。所提議的管道使用集中式伺服器和輕量級瀏覽器介面，使臨床醫師能夠在不下載本地數據集的情況下進行標註、盲測排名和審查。該管道還支持多評審者參與、集中結果聚合和自動統計分析。我們在一項涉及六位評審者的胎兒超聲分割研究中驗證了該管道，這些評審者的經驗水平涵蓋了專家、通才和非專家。系統自動生成了斯皮爾曼相關係數、肯德爾的 $τ$ 和前一選擇統計數據。結果顯示專家和其他組別之間的協議程度從中等到強。盲測評估結果顯示後期主動學習模型更受青睞。這些結果表明該管道可以支持臨床醫師中心的標註和可重複的人類-\ac{AI} 評估研究在超聲影像中。所提議的管道可在 \href{https://github.com/13204942/SonoRate}{GitHub} 上獲得。

2606.19144v1 by Jingyi Zhou, Senlin Luo, Haofan Chen

Current conversational AI systems have made significant progress in language generation, personalization, and long-context interaction. However, most existing methods model social behavior through isolated components such as emotion modeling, memory retrieval, or persona conditioning, lacking a unified framework to explain the emergence of stable social relationships and social intelligence in long-term human-AI interaction.To address this, we propose the Human-AI Coevolution Dynamics Framework (HACD-H), a formal model of human-AI interaction as a self-organizing social cognitive system. HACD-H integrates emotional adaptation, relational organization, social memory, and personality consistency into a unified dynamical framework and introduces principles including multi-timescale social cognition, relational attractors, trust basins, developmental phase transitions, and social cognitive energy dynamics.We construct a conversational dataset with approximately 14,700 interaction turns and develop a theory-driven empirical evaluation framework. Results reveal a hierarchy of temporal persistence in social cognition, stable relational attractors, phase-transition-like developmental patterns, and a structured social cognitive energy landscape. Social intelligence shows a significant negative correlation with social cognitive energy (r = -0.391, p < 0.001), and interaction trajectories exhibit progressive energy reduction over time.These findings suggest that social intelligence emerges from long-term social cognitive coevolution rather than isolated conversational capabilities. HACD-H provides a unified theoretical foundation for modeling adaptive human-AI social interaction and developing socially intelligent AI systems.

摘要：目前的對話式人工智慧系統在語言生成、個性化和長期上下文互動方面取得了顯著進展。然而，大多數現有的方法通過孤立的組件如情感建模、記憶檢索或角色調整來建模社會行為，缺乏統一的框架來解釋穩定社會關係和社會智慧在長期人機互動中的出現。為了解決這個問題，我們提出了人機共演化動力學框架（HACD-H），這是一個將人機互動視為自組織社會認知系統的正式模型。HACD-H將情感適應、關係組織、社會記憶和個性一致性整合到一個統一的動力學框架中，並引入了包括多時間尺度社會認知、關係吸引子、信任盆地、發展階段轉變和社會認知能量動力學等原則。我們構建了一個包含約14,700次互動回合的對話數據集，並開發了一個以理論為驅動的實證評估框架。結果顯示社會認知中的時間持久性層次、穩定的關係吸引子、類相變的發展模式以及結構化的社會認知能量景觀。社會智慧與社會認知能量之間顯示出顯著的負相關（r = -0.391, p < 0.001），而互動軌跡隨著時間的推移表現出逐步的能量減少。這些發現表明，社會智慧是從長期的社會認知共演化中產生的，而不是孤立的對話能力。HACD-H為建模自適應人機社會互動和開發社會智能人工智慧系統提供了統一的理論基礎。

Analysing drivers and interdependencies in European electricity markets using XAI

2606.19118v1 by Antoine Pesenti, Aidan O'Sullivan

Electricity markets are inherently complex systems characterised by strong nonlinearities, high-dimensional interactions, and increasing interdependence across regions. While deep neural networks (DNNs) have demonstrated strong predictive capabilities for electricity prices, their lack of interpretability limits their usefulness for understanding the underlying drivers of price formation. This paper addresses this gap by combining DNN models with explainable artificial intelligence (XAI) techniques to analyse the determinants of electricity prices across 39 European bidding zones. We employ SHAP (SHapley Additive exPlanations) to quantify feature contributions and apply and extend SSHAP, an aggregation framework to improve interpretability in high-dimensional settings. The analysis identifies that renewable energy sources, particularly solar, play a disproportionately important role in price formation despite their lower share in total power generation. Gas prices remain a dominant and consistent driver across electricity markets, while interconnections significantly shape price dynamics, highlighting the strong interdependence of European electricity systems. In addition, a synthetic EU-wide electricity market is constructed to explore the counterfactual scenario of a fully integrated market with a single price.

摘要：電力市場本質上是複雜的系統，特徵是強非線性、高維互動和各地區之間日益增強的相互依賴性。雖然深度神經網絡（DNN）在電力價格的預測能力上表現出色，但其缺乏可解釋性限制了其對理解價格形成的基本驅動因素的實用性。本文通過將DNN模型與可解釋的人工智慧（XAI）技術相結合，來分析39個歐洲競標區域的電力價格決定因素，填補了這一空白。我們使用SHAP（SHapley Additive exPlanations）來量化特徵貢獻，並應用及擴展SSHAP，這是一個聚合框架，用於提高高維設置中的可解釋性。分析顯示，儘管可再生能源，特別是太陽能，在總發電量中的比例較低，但在價格形成中扮演著不成比例的重要角色。天然氣價格仍然是電力市場中的主導和一致驅動因素，而互聯網絡則顯著影響價格動態，突顯了歐洲電力系統之間的強相互依賴性。此外，構建了一個合成的EU範圍內電力市場，以探索完全整合市場的反事實情境，並且只有一個價格。

APT: Atomic Physical Transitions for Causal Video-Language Understanding

2606.18586v1 by Shang Wu, Haoran Lu, Songling Liu, Chenwei Xu, Lie Lu, Pranav Maneriker, Fan Du, Manling Li, Zhaoran Wang, Han Liu

Physical events are not understood by their names alone, but by the causal state changes that compose them. A clip-level label such as "bounce" can be correct while hiding the process that makes the event physically valid, from support loss and contact onset to rebound and settling. To make this hidden process explicit, we introduce Atomic Physical Transitions (APTs): minimal, temporally localized state changes that bind a visible cue to an active physical mechanism and before/after dynamical regimes. An APT chain represents a video as an ordered causal transition sequence rather than a single aggregate event label: event labels tell what happened; APT chains explain why it happened. To make APTs learnable by VLMs, we construct mixed-source APT data from human annotations and simulator ground truth, covering 14 transition types across contact, gravity, friction, and rotation/stability, with 27,303 timed instances over 1,246 trials. Using this data, we find that current VLMs miss transition-level physics, with zero-shot recall at most 14% and errors dominated by missed transitions. Direct fine-tuning on APT chains improves transition detection but causes event-level forgetting, indicating that the model learns a specialized answer format rather than a reusable physical representation. We therefore propose APT-Tune, a parameter-efficient recipe that teaches VLMs to use causal transitions without forgetting how to answer video questions. It combines image-pad-aware supervision, format-conditional co-training, and mechanism-conditioned domain-to-type decoding to make APT learning format-robust and physically grounded. With only 11 M LoRA parameters on Qwen3-VL-2B, APT-Tune substantially improves APT recall while also improving event-level video transfer. These results show that APTs are not a new answer format, but a human-aligned causal supervision signal for physical video understanding.

摘要：物理事件並不是單靠名稱就能理解，而是透過組成它們的因果狀態變化來理解。像「彈跳」這樣的片段級標籤可能是正確的，但卻隱藏了使事件在物理上有效的過程，從支撐喪失和接觸開始到反彈和穩定。為了使這一隱藏過程明確，我們引入了原子物理轉變（APTs）：最小的、時間上局部的狀態變化，將可見提示與活躍的物理機制及前後動態範疇聯繫起來。一個APT鏈將視頻表示為有序的因果轉變序列，而不是單一的聚合事件標籤：事件標籤告訴我們發生了什麼；APT鏈解釋了為什麼會發生。為了使APT能被VLM學習，我們從人類註釋和模擬器真實數據中構建了混合來源的APT數據，涵蓋了接觸、重力、摩擦和旋轉/穩定等14種轉變類型，共有27,303個計時實例，分佈在1,246次試驗中。使用這些數據，我們發現當前的VLM在轉變級物理上存在缺失，零樣本回憶率最多為14%，且錯誤主要是由於漏掉的轉變。對APT鏈進行直接微調改善了轉變檢測，但卻導致事件級的遺忘，這表明模型學會了一種專門的答案格式，而不是可重用的物理表示。因此，我們提出了APT-Tune，一種參數高效的方案，教導VLM在不忘記如何回答視頻問題的情況下使用因果轉變。它結合了圖像填充感知監督、格式條件共同訓練和機制條件的域到類型解碼，使APT學習在格式上穩健且在物理上有根據。在Qwen3-VL-2B上僅用11 M LoRA參數，APT-Tune顯著提高了APT的回憶率，同時改善了事件級的視頻轉移。這些結果顯示，APT不是一種新的答案格式，而是用於物理視頻理解的人類對齊因果監督信號。

DeFAb: A Verifiable Benchmark for Defeasible Abduction in Foundation Models

2606.18557v1 by Patrick Cooper, Alvaro Velasquez

A rule-based logic solver resolves every instance in our benchmark in under 50 microseconds with 100% accuracy; the best frontier language model reaches 65% at best and drops to 23.5% under rendering-robust evaluation (worst case over four surface renderings). We introduce DeFAb (Defeasible Abduction Benchmark), a dataset and generation pipeline that converts four decades of publicly funded knowledge bases into formally grounded instances for defeasible abduction: constructing hypotheses that explain anomalies by overriding defaults while preserving unrelated expectations. Because every hypothesis must pass polynomial-time checks for valid derivation, conservativity, and minimality, DeFAb makes logical rigor the instrument for measuring creativity and theoretical reasoning, scoring the disciplined construction of theory revisions rather than fluent but theory-destroying prose. The pipeline pairs taxonomic hierarchies (OpenCyc, YAGO, Wikidata) with behavioral property graphs (ConceptNet, UMLS) to produce 372,648+ instances across 33.75M materialized rules from 18 sources, in three levels with polynomial-time verifiable gold standards. Four frontier models do not reliably internalize defeasible reasoning: rendering-robust Level 2 accuracy is 7.8-23.5%; chain-of-thought variance (~36 pp) exceeds any inter-model gap; and a matched contamination control isolates a +19.4 pp Level 3 gap. We further release DeFAb-Hard (a 235-instance Level 3 difficulty variant; best model 53.3% vs 100% symbolic) and CONJURE (a kernel-verified transformative-creativity variant of 560 Lean 4/Mathlib instances whose gold answers are definitions the proof kernel did not previously contain, judge-free verifier; a pilot finds zero novel concepts). The same verifier doubles as an exact reward for preference optimization (DPO, RLVR/GRPO). Released under MIT at https://huggingface.co/datasets/PatrickAllenCooper/DeFAb.

摘要：一個基於規則的邏輯解決器在我們的基準測試中以 50 微秒內解決每個實例，並且準確率達到 100%；最佳的前沿語言模型最多達到 65%，在渲染穩健評估下則降至 23.5%（在四次表面渲染的最壞情況下）。我們介紹 DeFAb（可駁回的歸納基準），這是一個數據集和生成管道，將四十年的公共資助知識庫轉換為可駁回歸納的形式化實例：通過覆蓋預設來構建解釋異常的假設，同時保留無關的期望。因為每個假設必須通過多項式時間檢查以驗證有效推導、保守性和最小性，DeFAb 使邏輯嚴謹成為衡量創造力和理論推理的工具，評分理論修訂的有序構建，而不是流暢但毀滅理論的散文。該管道將分類層次（OpenCyc、YAGO、Wikidata）與行為屬性圖（ConceptNet、UMLS）配對，以從 18 個來源生成 372,648+ 個實例，涵蓋 33.75M 的具體化規則，並設有三個層級和多項式時間可驗證的黃金標準。四個前沿模型並不可靠地內化可駁回推理：渲染穩健的 Level 2 準確率為 7.8-23.5%；思維鏈變異（約 36 pp）超過任何模型間的差距；而匹配的污染控制則隔離出 +19.4 pp 的 Level 3 差距。我們進一步發布 DeFAb-Hard（235 個實例的 Level 3 難度變體；最佳模型 53.3% 對比 100% 符號）和 CONJURE（560 個 Lean 4/Mathlib 實例的核心驗證轉化創造力變體，其黃金答案是證明核心之前未包含的定義，無評判的驗證者；一項試點發現零個新概念）。同一驗證者也作為偏好優化（DPO，RLVR/GRPO）的精確獎勵。根據 MIT 授權發布，網址為 https://huggingface.co/datasets/PatrickAllenCooper/DeFAb。

PSyGenTAB: A Privacy-Preserving Framework for Synthetic Clinical Tabular Data Generation via Constrained Optimization

2606.18518v1 by Arshia Ilaty, Hossein Shirazi, Manasi Chitale, Kedar Hegde, Dhanalakshmi Ramesh, Rashmi S. Manjunath, Amir Rahmani, Hajar Homayouni

The development of medical AI is constrained by limited access to high-quality clinical data due to institutional silos and strict privacy regulations such as HIPAA and GDPR. Synthetic data generation offers a potential solution, but existing methods lack principled mechanisms to explicitly manage the privacy-utility trade-off, often degrading clinically meaningful patterns or risking patient re-identification. We present PSyGenTAB, a privacy-preserving generative framework that formulates synthetic healthcare data generation as a constrained optimization problem solved using the Augmented Lagrangian Method. By embedding configurable privacy constraints directly into model training, PSyGenTAB enforces minimum privacy thresholds while maximizing clinical data utility. Across multiple clinically motivated benchmarks, PSyGenTAB preserves inter-feature clinical relationships and minority-class diagnostic patterns essential for reliable health AI. Downstream evaluation using Train-on-Synthetic, Test-on-Real and Train-on-Real, Test-on-Synthetic protocols shows that models trained on synthetic data achieve performance comparable to those trained on real patient records. Privacy auditing further demonstrates reduced exact record reproduction and strong resilience to membership inference attacks. These results establish PSyGenTAB as a principled framework for balancing privacy protection and clinical utility in synthetic healthcare data, supporting secure cross-institutional AI development.

摘要：醫療AI的發展受到高品質臨床數據獲取有限的限制，這是由於機構孤島和嚴格的隱私法規，例如HIPAA和GDPR。合成數據生成提供了一個潛在的解決方案，但現有的方法缺乏原則性機制來明確管理隱私與效用之間的權衡，這往往會降低臨床上有意義的模式或危及患者的重新識別。我們提出了PSyGenTAB，一個保護隱私的生成框架，將合成醫療數據生成公式化為一個約束優化問題，並使用增強拉格朗日方法解決。通過將可配置的隱私約束直接嵌入模型訓練中，PSyGenTAB在最大化臨床數據效用的同時，強制執行最低隱私閾值。在多個臨床動機的基準測試中，PSyGenTAB保留了臨床特徵之間的關係和對可靠健康AI至關重要的少數類別診斷模式。使用“在合成數據上訓練，在真實數據上測試”和“在真實數據上訓練，在合成數據上測試”的下游評估顯示，基於合成數據訓練的模型達到了與基於真實患者記錄訓練的模型相當的性能。隱私審計進一步顯示出精確記錄再現的減少和對會員推斷攻擊的強大抵抗力。這些結果確立了PSyGenTAB作為一個原則性框架，在合成醫療數據中平衡隱私保護和臨床效用，支持安全的跨機構AI開發。

From Specification to Execution: AI Assisted Scientific Workflow Management

2606.18425v1 by Komal Thareja, Hamza Safri, Rajiv Mayani, Anirban Mandal, Ewa Deelman

Scientific workflow management systems (WMS) support scalable and reproducible execution of complex pipelines, but workflow design, implementation, and debugging remain largely manual and require significant expertise. Recent approaches using large language models (LLMs) show promise for workflow generation from natural language, but often rely on direct code synthesis, which limits transparency, reproducibility, and integration with workflow systems. We present an AI-assisted approach to scientific workflow management that combines specification-driven workflow generation, automated debugging, and distributed execution. The method introduces a structured specification phase that separates workflow intent, design, and implementation, allowing validation prior to code generation. We also develop an LLM-based debugging agent that diagnoses and resolves failures across multiple system layers. To support distributed execution and user interaction, we integrate Pegasus, a widely used WMS, with a Model Context Protocol (MCP) layer, providing a unified interface for workflow submission, monitoring, and control. We evaluate the approach using a federated learning workflow for medical imaging, chosen for its parallel, iterative, and dependency-intensive structure. The system generated and executed large-scale workflows with thousands of jobs, reduced debugging effort, and allowed non-expert users to construct workflows with expert-level design patterns. These results indicate that end-to-end AI-assisted workflow generation and execution is feasible, and point toward AI-driven platforms for managing the scientific workflow lifecycle.

摘要：科學工作流程管理系統（WMS）支持可擴展和可重現的複雜管道執行，但工作流程的設計、實施和除錯仍然主要是手動進行，並且需要相當的專業知識。最近使用大型語言模型（LLMs）的方法顯示出從自然語言生成工作流程的潛力，但通常依賴於直接的代碼合成，這限制了透明度、可重現性和與工作流程系統的整合。我們提出了一種AI輔助的科學工作流程管理方法，結合了以規範驅動的工作流程生成、自動化除錯和分佈式執行。該方法引入了一個結構化的規範階段，將工作流程的意圖、設計和實施分開，允許在生成代碼之前進行驗證。我們還開發了一個基於LLM的除錯代理，能夠診斷和解決多個系統層次的故障。為了支持分佈式執行和用戶互動，我們將廣泛使用的WMS Pegasus與模型上下文協議（MCP）層集成，提供一個統一的工作流程提交、監控和控制界面。我們使用一個聯邦學習的醫學影像工作流程來評估該方法，因為它具有並行、迭代和依賴密集的結構。該系統生成並執行了具有數千個作業的大規模工作流程，減少了除錯工作，並允許非專家用戶以專家級設計模式構建工作流程。這些結果表明，端到端的AI輔助工作流程生成和執行是可行的，並指向AI驅動的平台以管理科學工作流程的生命週期。

RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills

2606.18203v1 by Weizhi Zhang, Zechen Li, Hamid Palangi, Ben Graef, A. Ali Heydari, Simon A. Lee, Salman Rahman, Ray Luo, Zeinab Esmaeilpour, Erik Schenck, Chloe Zhang, Yamin Li, Menglian Zhou, Philip S. Yu, Daniel McDuff, Lindsey Sunden, Mark Malhotra, Shwetak Patel, Ahmed A. Metwally

The LLM-empowered personal health agents with user health (sensor) metrics have offered a promising pathway to alleviate global disparities in healthcare access. However, large-scale clinical deployment remains constrained by an open-ended evaluation bottleneck: physician annotation is reliable but costly and unscalable, while LLM-as-a-judge evaluators are scalable but subjective, inconsistent, and sometimes clinically misaligned. We introduce RubricsTree, a scalable evaluation framework with an expert-aligned hierarchical taxonomy of over 100 atomic, clinically-verifiable Boolean rubrics, evolving from the insights of 4,000 real user queries through an iterative human-in-the-loop curation protocol with an expertise panel led by an experienced physician. A context-aware adaptive router activates only the relevant auto-weighted rubric subset per query, providing the throughput needed for scalable evaluation with expert-aligned quality. Through a systematic meta-evaluation, we show that RubricsTree (i) substantially exceeds a strong large-scale evaluation baseline in expert alignment on challenging open-ended queries; (ii) reliably penalizes contextually degraded responses; and (iii) when used as structured instructions, text feedback, or training rewards for performance optimization, yields up to ~66% relative gains on HealthBench for Gemini, GPT, and Qwen model families. RubricsTree thus provides a scalable, auditable, and evolving evaluation infrastructure required for the continuous optimization of product-level personal healthcare AI.

摘要：LLM 驅動的個人健康代理與用戶健康（傳感器）指標提供了一條有希望的途徑，以減輕全球醫療保健獲取的不平等。然而，大規模臨床部署仍然受到一個無限期評估瓶頸的限制：醫生註釋可靠但成本高昂且無法擴展，而 LLM 作為評估者則可擴展但主觀、不一致，有時與臨床不符。我們介紹了 RubricsTree，一個可擴展的評估框架，具有專家對齊的分層分類法，包含超過 100 個原子級的臨床可驗證布爾標準，這些標準源於 4,000 個真實用戶查詢的洞察，通過一個由經驗豐富的醫生領導的專家小組進行的迭代人機協作策劃協議進化而來。上下文感知的自適應路由器僅在每個查詢中激活相關的自動加權標準子集，提供可擴展評估所需的通量，並保持專家對齊的質量。通過系統的元評估，我們顯示 RubricsTree (i) 在挑戰性的開放式查詢上，專家對齊的表現顯著超過強大的大規模評估基準；(ii) 可靠地懲罰上下文退化的回應；以及 (iii) 當用作結構化指令、文本反饋或性能優化的訓練獎勵時，對 Gemini、GPT 和 Qwen 模型系列在 HealthBench 上產生高達約 66% 的相對增益。因此，RubricsTree 提供了一個可擴展的、可審計的、持續演進的評估基礎設施，滿足產品級個人健康 AI 持續優化的需求。

Towards Understanding and Measuring COGNITIVE ATROPHY in LLM Behaviour

2606.18129v1 by Abeer Badawi, Moyosoreoluwa Olatosi, Negin Baghbanzadeh, Laleh Seyyed-Kalantari, Frank Rudzicz, R. Shayna Rosenbaum, Sara Pishdadian, Elham Dolatabadi

Recent incidents involving LLMs used for mental-health support reveal a critical evaluation gap: surface-level safety scores do not capture how models behave across realistic, emotionally sensitive interactions over time. Existing benchmarks measure knowledge, safety, or static response quality, but miss whether LLM interactions help users keep reflecting, coping, and making decisions themselves. We formalize this missing dimension as COGNITIVE ATROPHY, a process-level behavioural measure in AI-mediated mental-health support distinct from safety and helpfulness. To measure it, we introduce COGNITIVE ATROPHY BENCH, a clinically grounded benchmark built from 1,576 fully human-generated counseling conversations, 15,680 turns, and 42,230 responses from five LLMs. Three clinical and neuropsychology experts developed a 20-attribute schema spanning user context, response behaviour, and global risk flags; six trained clinical reviewers applied it with span-grounded evidence, producing 5,324 reviewer judgments. We further introduce the User-Input Risk Index (UIRI), the Cognitive Atrophy Risk Index (ARI), and trajectory summaries. Across five LLMs, models show a consistent moderate-to-high level of atrophy-aligned behaviour across single and multi-turn settings. While models generally respond to overt safety cues, they adapt less reliably when users seek solutions or decisions. The dominant recurring patterns are directive advice, problem-solving, recommendation responses, topic shifts, and forms of validation that may reinforce dependence rather than reflection. Our work makes COGNITIVE ATROPHY measurable and provides a foundation for auditing model behaviour in sensitive LLM conversations.

摘要：最近涉及用於心理健康支持的LLM事件揭示了一個關鍵的評估缺口：表面上的安全分數無法捕捉模型在現實情境中隨時間推移的情感敏感互動中的行為。現有的基準測量知識、安全性或靜態反應質量，但未能評估LLM互動是否幫助用戶持續反思、應對和自主做出決策。我們將這一缺失的維度正式化為認知萎縮（COGNITIVE ATROPHY），這是一種在AI介導的心理健康支持中與安全性和幫助性不同的過程層面行為測量。為了測量它，我們引入了認知萎縮基準（COGNITIVE ATROPHY BENCH），這是一個基於1,576個完全由人類生成的諮詢對話、15,680次回合和來自五個LLM的42,230個回應的臨床基準。三位臨床和神經心理學專家開發了一個涵蓋用戶背景、回應行為和全球風險標誌的20屬性架構；六位經過培訓的臨床審核員應用該架構並提供基於證據的評估，產生了5,324條審核判斷。我們進一步引入了用戶輸入風險指數（User-Input Risk Index, UIRI）、認知萎縮風險指數（Cognitive Atrophy Risk Index, ARI）和軌跡摘要。在五個LLM中，模型在單回合和多回合設置中顯示出一致的中到高水平的萎縮對齊行為。儘管模型通常對明顯的安全提示作出反應，但當用戶尋求解決方案或決策時，它們的適應性較低。主導的重複模式包括指導性建議、問題解決、推薦回應、主題轉換和可能加強依賴而非反思的驗證形式。我們的工作使認知萎縮可測量，並為審計敏感LLM對話中的模型行為提供了基礎。

Agentic AI-based Framework for Mitigating Premature Diagnostic Handoff and Silent Hallucination in Healthcare Applications

2606.18068v1 by Divyansh Srivastava, Shreya Ghosh, Anshul Verma, Rajkumar Buyya

Recent advances in Large Language Models (LLMs) and multi-agent systems have driven the rise of Agentic AI, showing promise for medical reasoning. However, open-ended conversational agents remain prone to two critical failure modes: premature diagnostic handoff and silent clinical hallucinations that may go undetected before reaching the patient. In this work, we propose a multi-agent framework that addresses both issues by replacing ``LLM-as-a-judge'' routing with deterministic orchestration constraints. The framework incorporates two safety mechanisms. First, a neuro-symbolic state-tracking gate enforces completeness of the OLDCARTS clinical protocol (Onset, Location, Duration, Character, Aggravating/Alleviating factors, Radiation, Timing, and Severity) by blocking diagnostic transitions until all required dimensions are collected. Second, an epistemic uncertainty quantification (UQ) gate computes semantic entropy (H) across K=5 independent diagnostic samples to identify and intercept divergent outputs before delivery. We evaluate the system using simulated patient agents powered by the llama-3.1-70b-instruct model on 150 test cases. The full architecture achieves 49.3% diagnostic precision, representing an absolute improvement of 11.3 percentage points over an unconstrained baseline. Additionally, we observe a statistically significant negative correlation (r = -0.181, p < 0.05) between OLDCARTS completeness (σ) and semantic entropy (H), suggesting that structured information gathering is associated with reduced diagnostic uncertainty.

摘要：最近在大型語言模型（LLMs）和多代理系統方面的進展推動了代理式人工智慧的興起，顯示出在醫學推理方面的潛力。然而，開放式對話代理仍然容易出現兩種關鍵的失敗模式：過早的診斷轉交和可能在到達患者之前未被檢測到的靜默臨床幻覺。在這項工作中，我們提出了一個多代理框架，通過用確定性編排約束取代“LLM作為裁判”的路由來解決這兩個問題。該框架包含兩個安全機制。首先，一個神經符號狀態跟蹤閘強制執行OLDCARTS臨床協議的完整性（起始、位置、持續時間、特徵、加重/緩解因素、輻射、時間和嚴重性），通過阻止診斷轉換直到收集所有所需的維度。其次，一個認知不確定性量化（UQ）閘計算K=5個獨立診斷樣本的語義熵（H），以識別並攔截在交付之前的分歧輸出。我們使用由llama-3.1-70b-instruct模型驅動的模擬患者代理在150個測試案例中評估系統。完整架構實現了49.3%的診斷精確度，與不受約束的基線相比，絕對改善了11.3個百分點。此外，我們觀察到OLDCARTS完整性（σ）與語義熵（H）之間存在統計上顯著的負相關（r = -0.181，p < 0.05），這表明結構化的信息收集與降低診斷不確定性相關。

When LLMs Analyze Scars: From Images to Clinically-Meaningful Features

2606.18063v1 by Ruman Wang, Hangting Ye

Medical image classification faces a fundamental dilemma: while deep learning models achieve remarkable performance at scale, real-world clinical scenarios often suffer from severe data scarcity due to annotation costs, privacy constraints, and disease rarity. This challenge is particularly pronounced in pathological scar classification, where differentiating keloids from hypertrophic scars requires subtle expert knowledge and labeled images are extremely limited. We propose a novel paradigm that repositions large language models (LLMs) as knowledge-driven feature engineers rather than end-to-end classifiers. We call this framework ScaFE (Scar Feature Engineering). Our key insight is that LLMs encode rich medical knowledge that can be externalized as executable feature extraction code, enabling the transformation of high-dimensional images into low-dimensional, clinically interpretable representations. Specifically, we prompt an LLM with established scar assessment criteria to generate deterministic Python code that extracts features aligned with clinical scoring systems such as the Vancouver Scar Scale. Our approach offers three key advantages: (1) data efficiency, achieving robust performance with limited training samples by decoupling knowledge acquisition from statistical learning; (2) privacy preservation, as raw images are processed locally without exposure to external LLMs; and (3) interpretability, through explicit features grounded in clinical reasoning. Extensive experiments on scar classification demonstrate that our method consistently outperforms end-to-end deep learning baselines or using LLMs as black-box classifiers under limited data conditions, establishing a promising direction for integrating LLMs into data-efficient and clinically transparent medical AI systems.

摘要：醫學影像分類面臨一個根本性的困境：雖然深度學習模型在大規模下表現卓越，但現實世界的臨床情境常常因為標註成本、隱私限制和疾病稀有性而遭遇嚴重的數據匱乏。這一挑戰在病理性疤痕分類中尤為明顯，因為區分凹疤和肥厚性疤痕需要微妙的專家知識，而標註的影像極為有限。我們提出了一種新穎的範式，將大型語言模型（LLMs）重新定位為知識驅動的特徵工程師，而非端到端的分類器。我們稱這一框架為ScaFE（疤痕特徵工程）。我們的關鍵見解是，LLMs編碼了豐富的醫學知識，這些知識可以外部化為可執行的特徵提取代碼，使高維影像轉換為低維且臨床可解釋的表示。具體而言，我們使用既定的疤痕評估標準來提示LLM生成確定性的Python代碼，提取與臨床評分系統（如溫哥華疤痕量表）對齊的特徵。我們的方法提供了三個主要優勢：（1）數據效率，通過將知識獲取與統計學習解耦，實現有限訓練樣本下的穩健性能；（2）隱私保護，因為原始影像在本地處理，未暴露於外部LLMs；以及（3）可解釋性，通過基於臨床推理的明確特徵。對疤痕分類的廣泛實驗表明，我們的方法在有限數據條件下始終優於端到端的深度學習基準或將LLMs用作黑箱分類器，確立了將LLMs整合進數據高效且臨床透明的醫學AI系統中的有前景方向。

Catastrophic Forgetting is Low-Rank: A Function-Space Theory for Continual Adaptation

2606.18024v1 by Ido Nitzan Hidekel, Dan Raviv

Catastrophic forgetting in continual adaptation is usually studied through parameter drift, replay, or distillation, but these views do not identify which output-space directions are vulnerable. We give a function-space account in the NTK regime: new-task training induces old-task prediction drift through the cross-task kernel, yielding a closed-form predictor for the forgetting vector before any new-task gradient step. In frozen-backbone linear-head PEFT-CL, where the model is linear in the trainable parameters, the predictor is exact up to numerical precision; for nonlinear adapters/full fine-tuning, it is a local NTK approximation. The same expression reveals that forgetting concentrates in a small number of old-task NTK eigenmodes and under frozen linear heads gives a Kronecker scaling rule for the vulnerable rank. These results clarify the relation to prior NTK-overlap theory, explain why parameter-space regularizers can miss output-space interference, and motivate a targeted spectral regularizer.

摘要：在持續適應中的災難性遺忘通常通過參數漂移、重播或蒸餾來研究，但這些觀點並未確定哪些輸出空間方向是脆弱的。我們在 NTK 範疇中給出了一個函數空間的解釋：新任務訓練通過跨任務核引起舊任務預測漂移，從而在任何新任務梯度步驟之前產生遺忘向量的封閉形式預測器。在凍結主幹線性頭的 PEFT-CL 中，模型在可訓練參數上是線性的，預測器的精確度達到數值精度；對於非線性適配器/完全微調，它是一個局部 NTK 近似。相同的表達式顯示，遺忘集中在少數幾個舊任務 NTK 特徵模式中，並且在凍結的線性頭下給出了脆弱秩的克羅內克縮放規則。這些結果澄清了與先前 NTK 重疊理論的關係，解釋了為什麼參數空間正則化器可能會忽略輸出空間的干擾，並激發了一個針對性的譜正則化器。

LoopCoder-v2: Only Loop Once for Efficient Test-Time Computation Scaling

2606.18023v1 by Jian Yang, Shawn Guo, Wei Zhang, Tianyu Zheng, Yaxin Du, Haau-Sing Li, Jiajun Wu, Yue Song, Yan Xing, Qingsong Cai, Zelong Huang, Chuan Hao, Ran Tao, Xianglong Liu, Wayne Xin Zhao, Mingjie Tang, Weifeng Lv, Ming Zhou, Bryan Dai

Looped Transformers scale latent computation by repeatedly applying shared blocks, but sequential looping increases latency and KV-cache memory with the loop count. Parallel loop Transformers (PLT) alleviate this cost through cross-loop position offsets (CLP) and shared-KV gated sliding-window attention, making loop count a practical design choice. We therefore study PLT loop-count selection through a gain--cost view: an extra loop may refine representations, but CLP also introduces a positional mismatch at each loop boundary. We instantiate this study by training LoopCoder-v2, a family of 7B PLT coders with different loop counts, from scratch on 18T tokens, followed by matched instruction tuning and evaluation. Empirically, the two-loop variant delivers broad gains over the non-looped baseline across code generation, code reasoning, agentic software engineering, and tool-use benchmarks, improving SWE-bench Verified from 43.0 to 64.4 points and Multi-SWE from 14.0 to 31.0 points. In contrast, variants with three or more loops regress, revealing a strongly non-monotonic loop-count effect. Our diagnostics show that loop 2 provides the main productive refinement, while later loops yield diminishing, oscillatory updates and reduced representational diversity. Because the CLP-induced mismatch remains roughly fixed as refinement gains shrink, the offset cost increasingly dominates. This gain--cost trade-off explains PLT's saturation at two loops and provides diagnostics for loop-count selection.

摘要：循環Transformer透過重複應用共享區塊來擴展潛在計算，但序列循環會隨著循環次數增加延遲和KV快取記憶體。平行循環Transformer（PLT）通過交叉循環位置偏移（CLP）和共享-KV門控滑動窗口注意力來減輕這一成本，使循環次數成為一個實用的設計選擇。因此，我們從增益-成本的角度研究PLT循環次數的選擇：額外的循環可能會細化表示，但CLP在每個循環邊界也引入了位置不匹配。我們通過訓練LoopCoder-v2來實現這項研究，這是一個擁有不同循環次數的7B PLT編碼器系列，從頭開始在18T標記上進行訓練，隨後進行匹配的指令調整和評估。實證結果顯示，兩循環變體在代碼生成、代碼推理、主動軟體工程和工具使用基準測試中，對比非循環基準提供了廣泛的增益，將SWE-bench Verified從43.0提高到64.4分，Multi-SWE從14.0提高到31.0分。相比之下，具有三個或更多循環的變體則表現回落，顯示出強烈的非單調循環次數效應。我們的診斷顯示，循環2提供了主要的生產性細化，而後續循環則產生遞減的、震盪的更新和減少的表示多樣性。由於CLP引起的不匹配在細化增益縮小時大致保持不變，因此偏移成本日益主導。這一增益-成本權衡解釋了PLT在兩個循環時的飽和，並為循環次數選擇提供了診斷依據。

A Quantitative Analysis of Multimodal Biomarkers in Alzheimer's Disease

2606.17867v1 by Antonio Scardace, Daniele Ravì

Despite increasing adoption of multimodal approaches in Alzheimer's Disease (AD) research -- aimed at integrating molecular, structural, clinical, and genetic biomarkers to enhance disease characterization -- the relationships among these modalities remain poorly understood. A systematic analysis of their dynamic interaction is essential for improving disease modeling, identifying redundant assessments, and reducing patient burden and acquisition costs. In this paper, we present a quantitative analysis of multimodal AD biomarkers by integrating tau-PET, structural MRI, cognitive scores (MMSE and CDR), and APOE4 data from 789 subjects drawn from the ADNI dataset. In our analyses, we (A) quantify cross-modal mutual information and explained variance to assess redundancy and predictive dependencies; (B) examine associations between tau topologies and structural atrophy across brain regions to select informative ROIs; (C) perform a statistical decomposition of the tau-cognition association into atrophy-related and atrophy-independent components; (D) and identify a dominant neurodegenerative trajectory that aligns with cognitive decline. This study provides a systematic characterization of cross-modal relationships, improving the interpretability and selection of biomarkers in AD. Code is publicly available at: https://github.com/antonioscardace/Multimodal-AD.

摘要：儘管在阿茲海默症（AD）研究中越來越多地採用多模態方法——旨在整合分子、結構、臨床和遺傳生物標記以增強疾病特徵——這些模態之間的關係仍然不甚了解。對它們動態互動的系統分析對於改善疾病建模、識別冗餘評估以及減少患者負擔和獲取成本至關重要。本文中，我們通過整合來自789名受試者的tau-PET、結構MRI、認知評分（MMSE和CDR）及APOE4數據，對多模態AD生物標記進行定量分析，這些數據來自ADNI數據集。在我們的分析中，我們（A）量化跨模態的互信息和解釋變異，以評估冗餘和預測依賴；（B）檢查tau拓撲與大腦區域結構性萎縮之間的關聯，以選擇有用的ROI；（C）對tau-認知關聯進行統計分解，將其分為與萎縮相關和與萎縮無關的成分；（D）並識別與認知衰退相一致的主導神經退行性軌跡。本研究提供了跨模態關係的系統特徵，改善了AD中生物標記的可解釋性和選擇性。代碼可在以下網址公開獲取：https://github.com/antonioscardace/Multimodal-AD。

Conservation Laws for Modern Neural Architectures

2606.17816v1 by Viet-Hoang Tran, Vinh Khanh Bui, Tan Lai Ngoc, Nam Nguyen, Tuan Dam, Tan M. Nguyen

Understanding gradient descent dynamics is key to explaining the success of over-parameterized models, where implicit bias manifests through conservation laws in gradient flow. While such laws are well understood for linear and ReLU networks, they remain largely unexplored for modern architectures. This work develops a unified framework to characterize conservation laws for contemporary models, including feedforward networks with GELU, SiLU, and SwiGLU activations, multihead attention with sinusoidal and rotary positional encodings, and Mixture-of-Experts architectures under diverse gating designs. Our theoretical findings are supported by experiments that validate the predicted invariants.

摘要：理解梯度下降的動態對於解釋過度參數化模型的成功至關重要，其中隱性偏差通過梯度流中的守恆定律表現出來。雖然這些定律在線性和ReLU網絡中已經得到很好的理解，但在現代架構中仍然大多未被探索。本研究開發了一個統一框架，以表徵當代模型的守恆定律，包括具有GELU、SiLU和SwiGLU激活的前饋網絡、帶有正弦和旋轉位置編碼的多頭注意力，以及在多樣化閘控設計下的專家混合架構。我們的理論發現得到了實驗的支持，這些實驗驗證了預測的不變性。

EComAgentBench: Benchmarking Shopping Agents on Long-Horizon Tasks with Distributed Hidden Intent

2606.17698v1 by Zeyao Du, Tong Li, Haibo Zhang

As LLM-based shopping agents enter production, existing benchmarks fail to capture how a shopper's requirements arrive: stated implicitly in the query, recorded in a profile, or revealed only when the right question is asked. Benchmarks that expose full intent upfront and grade only the final choice can neither pose this long-horizon challenge nor explain which requirement an agent missed. To address this gap, we introduce EComAgentBench, a benchmark of 662 tasks grounded in real Amazon products and reviews. Each task scatters these requirements across a visible query, a tool-gated profile, and scripted clarification; an agent must uncover hidden intent, verify candidates against attributes and review evidence, and commit to a single product within 100 tool calls. Moreover, typed, source-tagged rubrics grade every task, attributing each failure to a requirement and its source. Construction is automated yet reliable, with every answer fixed in code before any text is generated and every sample validated. Our evaluation of seven models reveals that even the strongest attains only 57.1% overall accuracy, and rubric satisfaction degrades from visible to hidden sources. Overall, we believe EComAgentBench will serve as a reproducible foundation for moving shopping agents from single-query search toward dependable assistance over long horizons.

摘要：隨著基於LLM的購物代理進入生產階段，現有的基準無法捕捉到購物者需求的來源：在查詢中隱含地表達、記錄在個人資料中，或僅在提出正確問題時才顯露出來。那些提前揭示完整意圖並僅對最終選擇進行評分的基準，既無法提出這種長期挑戰，也無法解釋代理錯過了哪一項需求。為了解決這一空白，我們推出了EComAgentBench，這是一個基於真實亞馬遜產品和評論的662項任務的基準。每個任務將這些需求分散在可見的查詢、工具限制的個人資料和腳本化的澄清中；代理必須揭示隱藏的意圖，根據屬性和評論證據驗證候選項，並在100次工具調用內承諾選擇一個產品。此外，類型化的、來源標記的評分標準對每個任務進行評分，將每次失敗歸因於一項需求及其來源。建設是自動化但可靠的，在生成任何文本之前，每個答案都已固定在代碼中，並且每個樣本都經過驗證。我們對七個模型的評估顯示，即使是最強的模型也僅達到57.1%的整體準確率，並且評分標準的滿意度從可見來源降級到隱藏來源。總體而言，我們相信EComAgentBench將作為一個可重複的基礎，推動購物代理從單一查詢搜索轉向長期可靠的協助。

From Brewing to Resolution: Tracing the Internal Lifecycle of Code Reasoning in LLMs

2606.17648v1 by Siyue Chen, Yifu Guo, Yuquan Lu, Zishan Xu, Jiaye Lin, Jianbo Lin, Siyu Zhang, Cheng Yang, Junxin Li, Yujia Li, Yu Huo, Ruixuan Wang

Standard accuracy metrics cannot explain why LLMs handle variable tracking but fail on semantically equivalent loops. We study an internal lifecycle of code reasoning in which models first brew the answer, making it linearly recoverable many layers before it becomes self-decodable, and then diverge into one of four resolution outcomes: Resolved, Overprocessed, Misresolved, or Unresolved. Understanding this lifecycle matters because similar task accuracies can mask fundamentally different failure modes that surface-level evaluation cannot detect. We introduce a dual diagnostic framework pairing layer-wise linear probing with Context-Stripped Decoding (CSD) and apply it to six code-reasoning task families across 16 models spanning Qwen, Llama, and DeepSeek architectures. All four outcomes carry substantial mass in every task family: overall Resolved is only 41.5%, with multiple tasks below 30%. Controlled sweeps over structure, depth, and operators expose task-specific failure bottlenecks: Function Call Resolved plunges from 61.1% to 2.5% as call depth increases from one to three. Across architectures and scales, the brewing scaffold remains stable, with normalized brewing duration 24-42% across all 16 models, while resolution success varies with capability. This indicates that the scaffold is a stable empirical regularity across the tested decoder-only Transformer families, whereas resolution success covaries with capability, scale, and training. Code: https://github.com/euyis1019/llm-brewing

摘要：標準準確性指標無法解釋為什麼 LLMs 能夠處理變量跟蹤卻在語義等價的迴圈上失敗。我們研究了一個代碼推理的內部生命周期，其中模型首先酝酿答案，使其在變成自我解碼之前的多層中線性可恢復，然後分歧為四種解決結果之一：已解決、過度處理、錯誤解決或未解決。理解這個生命周期很重要，因為類似的任務準確性可能掩蓋根本不同的失敗模式，這是表面評估無法檢測到的。我們引入了一個雙重診斷框架，將逐層線性探測與上下文剝離解碼（CSD）配對，並將其應用於跨越 Qwen、Llama 和 DeepSeek 架構的 16 個模型的六個代碼推理任務系列。所有四種結果在每個任務系列中都具有實質性質量：整體已解決率僅為 41.5%，多個任務低於 30%。對結構、深度和運算符的控制掃描揭示了特定任務的失敗瓶頸：函數調用已解決率隨著調用深度從一增加到三而從 61.1% 驟降至 2.5%。在不同架構和規模中，酝酿支架保持穩定，所有 16 個模型的標準化酝酿持續時間為 24-42%，而解決成功率則隨能力而變化。這表明，該支架在測試過的僅解碼 Transformer 系列中是一種穩定的實證規律，而解決成功率則與能力、規模和訓練相關聯。代碼：https://github.com/euyis1019/llm-brewing

SketchXplain: Intuitive Visual Explanations of Image Classifiers with Sketches

2606.17646v1 by Wencan Zhang, Mario Michelessa, Xuejun Zhao, Brian Y. Lim

Saliency map visualizations explain image-based AI predictions by pointing to regions, but these are often unintuitive and semantically unclear, leaving an interpretability gap. We argue that AI explanations should be intuitive -- coherent to user knowledge, yet simple and selective to accelerate interpretation. Inspired by artistic drawings, we propose SketchXplain to generate sketch-based visual explanations for intuitive image-based explainable AI (XAI). Combining techniques in saliency maps, concept-bottleneck models, and sketch optimization, SketchXplain integrates saliency to select coherent observation artifacts, concepts for knowledge coherence, cues to represent them, and abstraction for simplicity. Evaluating on face expression recognition, modeling and user studies showed that SketchXplain supported quicker interpretation with more aligned visualizations than saliency maps or simple drawings. Further evaluation on skin lesion diagnosis found that SketchXplain more coherently visualized disease symptoms, better supporting lay diagnosis. Thus, this work illustrates the value of sketches for intuitive, simple, coherent, and quick image-based XAI visualizations.

摘要：顯著性圖視覺化通過指向區域來解釋基於圖像的人工智慧預測，但這些通常不直觀且語義不清，留下了解釋性差距。我們認為，人工智慧的解釋應該是直觀的——與用戶知識一致，但又簡單且具選擇性，以加速解釋。受到藝術繪畫的啟發，我們提出了SketchXplain，用於生成基於草圖的視覺解釋，以實現直觀的基於圖像的可解釋人工智慧（XAI）。SketchXplain結合了顯著性圖、概念瓶頸模型和草圖優化技術，整合顯著性以選擇一致的觀察工件、知識一致性的概念、表示它們的提示以及簡單性的抽象。在面部表情識別的評估中，建模和用戶研究顯示，SketchXplain支持比顯著性圖或簡單繪圖更快的解釋，並且視覺化更一致。對皮膚病變診斷的進一步評估發現，SketchXplain更一致地視覺化疾病症狀，更好地支持非專業診斷。因此，這項工作說明了草圖在直觀、簡單、一致和快速的基於圖像的XAI視覺化中的價值。

Offline Preference-Based Trajectory Evaluation

2606.17541v1 by Fernando Diaz

Offline evaluation of agentic systems often collapses trajectories to terminal success, discarding information about partial progress and inducing widespread ties, creating substantial statistical inefficiency by reducing effective sample size and weakening the ability to distinguish systems. We propose preference-based trajectory evaluation, which compares trajectories directly through temporal preferences over progress and time-to-return profiles. We find that, across diverse agentic and interactive benchmarks, standard success-based metrics produce tied comparisons on roughly 75% of instances, whereas trajectory-aware preferences reduce ties to roughly 35%, improving discriminative power, ranking stability, and data efficiency. Our results suggest that benchmark saturation, often attributed to poor data collection or problem difficulty, may also be explained by the choice of evaluation measure.

摘要：離線評估代理系統通常將軌跡簡化為最終成功，忽略了部分進展的信息，並導致廣泛的平局，通過減少有效樣本大小和削弱區分系統的能力來創造實質性的統計低效率。我們提出了基於偏好的軌跡評估，通過對進展和回報時間輪廓的時間偏好直接比較軌跡。我們發現，在各種代理和互動基準中，標準的基於成功的指標在大約75%的情況下產生平局，而考慮軌跡的偏好將平局減少到大約35%，提高了區分能力、排名穩定性和數據效率。我們的結果表明，基準飽和，通常歸因於數據收集不良或問題難度，也可能由評估指標的選擇來解釋。

Decoding Hidden Deception in Reasoning LLMs: Activation Explainers for Deception Auditing

2606.17478v1 by Kexin Chen, Yi Liu, Haonan Zhang, Yanhui Li, Xinyu Deng, Dongxia Wang

As LLMs acquire stronger reasoning capabilities, deceptive behavior becomes an increasingly serious safety concern. Existing deception monitors either score visible transcripts or derive scalar probe scores from representation vectors, leaving little inspectable evidence about why a response is suspicious. We introduce STATEWITNESS, an activation explainer for deception auditing. A separate decoder reads a target model's hidden states, then answers natural-language queries or emits structured reports about them. We evaluate STATEWITNESS on two target reasoning LLMs across seven deception datasets. STATEWITNESS reaches 0.916 mean AUROC, a relative gain of 11.6% over the best black-box text monitor and 25.0% over the best activation-probe baseline under the same evaluation protocol. When combined with existing monitors, STATEWITNESS reduces missed deceptive examples in simple threshold ensembles. Beyond scalar detection, the decoder returns query-level answers, schema reports, and token- or sentence-level evidence traces for human inspection. We view this interface as a potential building block for broader interpretability and alignment tools.

摘要：隨著大型語言模型（LLMs）獲得更強的推理能力，欺騙行為成為越來越嚴重的安全問題。現有的欺騙監測工具要麼評分可見的記錄，要麼從表示向量中推導出標量探測分數，對於為什麼某個回應可疑幾乎沒有可檢查的證據。我們介紹了STATEWITNESS，一種用於欺騙審計的激活解釋器。一個單獨的解碼器讀取目標模型的隱藏狀態，然後回答自然語言查詢或發出有關它們的結構化報告。我們在七個欺騙數據集上對兩個目標推理LLM評估了STATEWITNESS。STATEWITNESS達到0.916的平均AUROC，相對於最佳黑箱文本監測器提高了11.6%，相對於最佳激活探測基準提高了25.0%，均在相同的評估協議下進行。當與現有監測器結合時，STATEWITNESS減少了簡單閾值集成中的漏檢欺騙例子。除了標量檢測外，解碼器還返回查詢級別的答案、模式報告以及供人類檢查的標記或句子級別的證據追踪。我們將這個介面視為更廣泛的可解釋性和對齊工具的潛在構建基石。

Treatment Response Optimized Clinical Decision Support AI System via Digital Twin Simulation

2606.17405v1 by Xinyu Qin, Anil K. Sood, Ruiheng Yu, Sara Corvigno, Elaine Stur, Lu Wang

Clinical decision support AI systems (CDSASs) must adapt to evolving patient conditions in real-time while adhering to strict safety constraints. We present an online adaptive framework that integrates Treatment Effect (TE) estimation to quantify clinical benefits, a patient Digital Twin (DT) to simulate treatment trajectories, and Reinforcement Learning (RL) for sequential decision-making. The AI system is initially trained on historical medical records and operates in a continuous learning loop. To ensure safety, a rule-based module monitors vital signs and blocks contraindicated treatments. Cases with strong internal model disagreement are flagged for clinician review, simulated in our experiments via a pre-trained outcome model. We validate our framework using both a synthetic clinical simulator and a real-world ovarian cancer dataset from The Cancer Genome Atlas (TCGA). In both simulated and clinical settings, our method demonstrated superior effectiveness and stability in recommending treatments compared to standard computational baselines. Furthermore, the AI system maintains low latency and requires expert consultation for only a minority of cases in our experimental validation, demonstrating its potential as a safe, clinician-supervised tool for personalized medicine that continuously improves through practical use.

摘要：臨床決策支持人工智慧系統 (CDSASs) 必須在遵循嚴格安全限制的同時，實時適應不斷變化的患者狀況。我們提出了一個在線自適應框架，該框架整合了治療效果 (TE) 估算以量化臨床效益、患者數位雙胞胎 (DT) 以模擬治療軌跡，以及強化學習 (RL) 用於序列決策。該人工智慧系統最初在歷史醫療記錄上進行訓練，並在持續學習循環中運作。為了確保安全，一個基於規則的模組監測生命體徵並阻止禁忌治療。內部模型存在強烈不一致的案例會被標記以供臨床醫生審查，這在我們的實驗中是通過預訓練的結果模型來模擬的。我們使用合成臨床模擬器和來自癌症基因組圖譜 (TCGA) 的真實卵巢癌數據集來驗證我們的框架。在模擬和臨床環境中，我們的方法在推薦治療方面展示了優越的有效性和穩定性，相較於標準計算基準。此外，該人工智慧系統保持低延遲，並且在我們的實驗驗證中只有少數案例需要專家諮詢，顯示其作為一個安全的、由臨床醫生監督的個性化醫療工具的潛力，並能通過實際使用不斷改進。

SpeechDx: A Multi-Task Benchmark for Clinical Speech AI

2606.17339v1 by Sejal Bhalla, Larry Kieu, Aina Merchant, Eyal de Lara, Alex Mariakakis

Speech offers a uniquely informative window into health by simultaneously engaging neurological, motor, respiratory, and vocal systems. Current clinical speech AI methods have largely progressed through isolated condition-specific studies, making results difficult to compare and generalization difficult to assess. We introduce SpeechDx, a large-scale benchmark for clinical speech AI spanning 12 datasets and 27 tasks across diverse health conditions. To enable evaluation across shared clinical mechanisms, SpeechDx structures tasks by the stage of speech production they disrupt: conceptualization, formulation, and articulation. The benchmark tests generalization by including tasks with limited labeled data and evaluating the same health condition across multiple datasets, distinguishing clinically meaningful patterns from dataset artefacts. We systematically evaluate 12 state-of-the-art audio encoders across all tasks and under zero-shot cross-condition transfer. Results show that large-scale speech models represent the strongest overall baselines, domain-specific models improve performance only on closely matched tasks, and no current representation generalizes reliably across the clinical speech landscape. SpeechDx establishes a shared evaluation framework for tracking progress toward general-purpose clinical speech representations

摘要：語音提供了一個獨特的資訊窗口，通過同時參與神經系統、運動系統、呼吸系統和聲音系統來了解健康。當前的臨床語音AI方法主要通過孤立的特定條件研究進展，使得結果難以比較，且難以評估其普遍性。我們介紹了SpeechDx，這是一個大規模的臨床語音AI基準，涵蓋12個數據集和27個任務，涉及多種健康狀況。為了能夠在共享的臨床機制中進行評估，SpeechDx根據它們干擾的語音產生階段來結構任務：概念化、形成和表達。該基準通過包括有限標記數據的任務來測試普遍性，並在多個數據集中評估相同的健康狀況，以區分臨床上有意義的模式與數據集的假象。我們系統地評估了12種最先進的音頻編碼器在所有任務中的表現，以及在零樣本跨條件轉移下的表現。結果顯示，大規模語音模型代表了最強的整體基準，特定領域模型僅在密切匹配的任務上提高性能，而目前沒有任何表示能夠在臨床語音領域中可靠地普遍化。SpeechDx建立了一個共享的評估框架，以追踪朝向通用臨床語音表示的進展。

Phantoms and Disclosures: a Causal Framework for Auditing Synthetic Data

2606.16952v1 by Kareem Amin, Rudrajit Das, Alessandro Epasto, Adel Javanmard, Dennis Kraft, Mónica Ribero, Sergei Vassilvitskii

The rapid adoption of generative AI and Large Language Models (LLMs) has spurred interest in synthetic data as a privacy-preserving alternative to sensitive real-world datasets. However, generating high-utility synthetic data often carries the risk of memorizing and regurgitating private information from the training corpus. In this work, we present a customizable empirical auditing framework designed to detect and explain such data disclosures. Our framework introduces a mechanism to distinguish between "true disclosures"-where the system directly reproduces a user's information-and "phantom disclosures''-where the system incidentally generates a user's data. By partitioning input data into training and holdout sets and applying rigorous statistical hypothesis testing, we determine if observed disclosures are consistent with strict privacy baselines, such as zero-learning or specific Differential Privacy (DP) bounds. Crucially, this approach requires no model access, no canary insertion, and no reference model training -only the synthetic output and a held-out control set. We demonstrate that this framework effectively functions as a membership inference attack, providing empirical lower bounds on privacy leakage that are tighter than prior data-based auditing methods. Our approach is model-agnostic, applies to any synthetic data generation mechanism, and requires orders of magnitude fewer computational resources than shadow-model or canary-based alternatives.

摘要：生成式人工智慧和大型語言模型（LLMs）的快速採用激發了對合成數據的興趣，作為一種保護隱私的替代方案，以取代敏感的現實世界數據集。然而，生成高效用的合成數據通常伴隨著記憶和重複訓練語料庫中私人信息的風險。在這項工作中，我們提出了一個可自定義的實證審計框架，旨在檢測和解釋這類數據洩露。我們的框架引入了一種機制，以區分「真實洩露」——系統直接重現用戶信息——和「幻影洩露」——系統偶然生成用戶數據。通過將輸入數據劃分為訓練集和保留集並應用嚴格的統計假設檢驗，我們確定觀察到的洩露是否與嚴格的隱私基準一致，例如零學習或特定的差分隱私（DP）界限。關鍵是，這種方法不需要模型訪問、不需要金絲雀插入，也不需要參考模型訓練——只需要合成輸出和一個保留的控制集。我們證明這個框架有效地作為一種成員推斷攻擊，提供了比先前基於數據的審計方法更緊湊的隱私洩露實證下限。我們的方法是模型無關的，適用於任何合成數據生成機制，並且所需的計算資源比影子模型或金絲雀基礎的替代方案少幾個數量級。

Demystifying Variance in Circuit Discovery of LLMs

2606.16920v1 by Frank Zhengqing Wu, Francesco Tonin, Volkan Cevher

Circuit discovery is a key technique in mechanistic interpretability to pinpoint the model components that are crucial for performing a given task. Although the current state-of-the-art method (EAP-IG) performs well on the metric of (un)faithfulness, it suffers from substantial variability. This includes resampling variance, where the circuit changes when we probe with a new batch of data from the same distribution; rephrasing variance, where the discovered circuit shifts when the prompts are rephrased; and sample-wise variance, where a circuit with low population unfaithfulness exhibits large fluctuations in unfaithfulness across individual samples. This paper studies the roots of these variances. We demonstrate that CEAP, our new circuit discovery method that improves upon EAP-IG with a theoretical guarantee, can substantially lessen resampling variance. We further show that rephrasing variance arises because prompts with different templates tend to activate different circuits in the model. This leads us to argue that it may be challenging to find a comprehensive circuit that explains and controls the model's behavior on a task, which can be expressed in countless templates, suggesting that LLMs may be inherently hard to steer. We show that sparsity, which has been claimed to form more compact and interpretable task circuits, fails to solve this problem. Regarding sample-wise variance, we argue that it is largely benign: extremely poor unfaithfulness scores often stem from how unfaithfulness is defined, rather than from defects in the measured circuits. We show that the magnitude of unfaithfulness is affected by selective contribution scaling, a neural mechanism that accounts for the extremely poor scores sometimes observed.

摘要：電路發現是機械解釋性中的一項關鍵技術，用以確定對於執行特定任務至關重要的模型組件。儘管當前最先進的方法（EAP-IG）在（不）忠實度的指標上表現良好，但它卻存在相當大的變異性。這包括重抽樣變異性，即當我們用來自同一分佈的新數據批次進行探測時，電路會發生變化；重新措辭變異性，即當提示被重新措辭時，發現的電路會發生變化；以及樣本級變異性，即具有低人口不忠實度的電路在個別樣本中顯示出不忠實度的大幅波動。
本文研究這些變異的根源。我們證明了CEAP，我們的新電路發現方法，改進了EAP-IG並具有理論保證，能夠顯著減少重抽樣變異性。我們進一步顯示，重新措辭變異性產生的原因是，具有不同模板的提示往往會激活模型中的不同電路。這使我們認為，找到一個綜合的電路來解釋和控制模型在任務上的行為可能是具有挑戰性的，因為這可以用無數模板來表達，這暗示著大型語言模型可能天生難以引導。我們顯示，稀疏性，雖然被聲稱能形成更緊湊和可解釋的任務電路，但未能解決這一問題。關於樣本級變異性，我們認為這在很大程度上是良性的：極差的不忠實度分數往往源於不忠實度的定義，而不是測量電路的缺陷。我們顯示，不忠實度的大小受到選擇性貢獻縮放的影響，這是一種神經機制，解釋了有時觀察到的極差分數。

Symbolic Informalization: Fluent, Productive, Multilingual

2606.16893v1 by Aarne Ranta

Symbolic informalization enables a reliable conversion of formal mathematics to natural language. It has the potential to make machine-checked content human-readable without loss of precision. In a traditional proof system usage, symbolic informalization generalizes the limited mechanisms of syntactic sugar into the ordinary language of mathematics. In a setting where proofs are constructed by artificial intelligence and autoformalization, symbolic informalization can explain what precisely has been constructed. This paper outlines the project Informath, which aims to show how symbolic informalization can produce fluent text with a reasonable development effort and address multiple formal and natural languages. Informath is based on an interlingual architecture, where Dedukti works as a hub between different proof systems (Agda, Lean, Rocq) and Grammatical Framework (GF) takes care of linguistic correctness and variation in different natural languages.

摘要：符號非正式化使得正式數學能夠可靠地轉換為自然語言。它有潛力使機器檢查的內容在人類可讀的情況下不失精確性。在傳統的證明系統使用中，符號非正式化將有限的語法糖機制概括為數學的普通語言。在由人工智慧和自動形式化構建證明的環境中，符號非正式化可以解釋究竟構建了什麼。本文概述了項目Informath，旨在展示符號非正式化如何在合理的開發努力下產生流暢的文本，並處理多種正式和自然語言。Informath基於一種跨語言架構，其中Dedukti作為不同證明系統（Agda、Lean、Rocq）之間的樞紐，而語法框架（GF）則負責不同自然語言中的語言正確性和變化。

Compositional Reasoning Depth Predicts Clinical AI Failure: Empirical Evidence Consistent with Transformer Compositionality Limits in Electronic Health Record Question Answering

2606.16890v1 by Sanjay Basu

Aggregate accuracy benchmarks conceal a systematic structure in how large language models fail at electronic health record (EHR) question answering: questions requiring more inferential steps produce disproportionately more errors. Motivated by theoretical results on transformer compositionality limits, we introduce a pre-specified hop-count taxonomy -- the number of distinct reasoning steps required to answer a clinical question from an EHR -- as a principled predictor of model failure. We annotate 313 clinician-generated MedAlign EHR question-answer pairs across four hop levels and evaluate 301 questions in a within-model ablation (claude-sonnet-4-6, zero-shot vs. extended thinking) and cross-architecture replications (gpt-4o and gpt-5.4-2026-03-05, zero-shot). All three models, spanning two providers and two OpenAI generations (GPT-4 and GPT-5), show monotone accuracy decline with hop count: Claude Sonnet zero-shot falls from 30.6% (hop=1) to 17.6% (hop=4) (Cochran-Armitage z=-2.30, p=0.011; OR per hop 0.72, 95% CI [0.56,0.92], p=0.008); GPT-4o replicates this (37.8% to 14.7%; OR 0.58 [0.45,0.75], p<0.001); and gpt-5.4-2026-03-05 confirms it (37.8% to 23.5%; OR 0.80 [0.66,0.98], p=0.027). A pre-specified context-sufficiency audit shows higher-hop questions are not differentially disadvantaged by EHR truncation (answerability 93-95% at hops 2-4 vs. 79% at hop=1), so the decline reflects compositional reasoning difficulty. Extended thinking did not significantly flatten the accuracy-depth curve across three reasoning conditions, and thinking-token usage scaled with hop count (r=0.31, p<0.0001), consistent with the predicted O(k) computational requirement. Hop count is thus a theory-motivated, cross-architecture predictor of large-language-model error on EHR question answering, with direct implications for deployment risk stratification of clinical AI.

摘要：聚合準確性基準隱藏了大型語言模型在電子健康紀錄（EHR）問題回答上失敗的系統性結構：需要更多推理步驟的問題產生不成比例的錯誤。基於對Transformer組合性限制的理論結果，我們引入了一個預先指定的跳數分類法——回答來自EHR的臨床問題所需的不同推理步驟數量——作為模型失敗的原則性預測指標。我們對313個臨床醫生生成的MedAlign EHR問題-回答對進行了標註，涵蓋了四個跳數級別，並在模型內部消融（claude-sonnet-4-6，零樣本與擴展思考）和跨架構複製（gpt-4o和gpt-5.4-2026-03-05，零樣本）中評估了301個問題。所有三個模型，涵蓋了兩個提供者和兩個OpenAI世代（GPT-4和GPT-5），都顯示出隨著跳數的增加準確性單調下降：Claude Sonnet零樣本從30.6%（跳=1）下降到17.6%（跳=4）（Cochran-Armitage z=-2.30，p=0.011；每跳的OR 0.72，95% CI [0.56,0.92]，p=0.008）；GPT-4o複製了這一點（37.8%下降至14.7%；OR 0.58 [0.45,0.75]，p<0.001）；而gpt-5.4-2026-03-05確認了這一點（37.8%下降至23.5%；OR 0.80 [0.66,0.98]，p=0.027）。一項預先指定的上下文充分性審計顯示，高跳數問題並未因EHR截斷而受到差異性劣勢（在跳數2-4時可回答率為93-95%，而在跳=1時為79%），因此下降反映了組合推理的困難。擴展思考並未顯著平坦化三種推理條件下的準確性-深度曲線，且思考標記的使用隨著跳數的增加而增長（r=0.31，p<0.0001），與預測的O(k)計算需求一致。因此，跳數成為一個理論驅動的、跨架構的預測指標，用於大型語言模型在EHR問題回答上的錯誤，對臨床AI的部署風險分層具有直接的影響。

Medical world models: representing medical states, modelling clinical dynamics and guiding intervention policies

2606.16721v1 by Ke Liu, Mengxuan Li, Yanyi Bao, Tianyun Zhang, Chong Chu, Jiajun Bu, Haishuai Wang

Medical diagnosis and treatment are dynamic processes in which patient states evolve over time and clinical interventions alter future outcomes. Although current medical AI can detect disease, estimate risk and generate reports, many systems still return static labels or scores, offering limited insight into how illness may progress or how alternative interventions may reshape its trajectory. Medical world models adapt the world-model idea from artificial intelligence to healthcare by learning internal simulators of patient-state dynamics. Their long-term goal is to help clinicians anticipate deterioration, compare treatment-conditioned futures and tailor care to individual patients. Yet relevant work remains scattered across foundation models, longitudinal modelling, disease simulation, treatment-effect estimation, reinforcement learning and digital twins. To bridge this gap, this review outlines a roadmap for advancing medical AI from isolated diagnosis and prediction toward medical world models that simulate disease evolution and support intervention decisions. This roadmap is organized around three coupled capabilities: patient-state construction, clinical dynamics modelling and intervention decision support. Across representative systems, the comparison highlights what each capability contributes and how partial components can be integrated into more mature perception--dynamics--planning systems. Finally, we identify the challenges involved in turning plausible rollouts into clinically useful simulators. Related literature is available at https://github.com/1999kevin/awesome_medical_world_models.

摘要：醫療診斷和治療是動態過程，患者狀態隨時間演變，臨床干預改變未來結果。儘管當前的醫療人工智慧可以檢測疾病、估算風險並生成報告，但許多系統仍然返回靜態標籤或分數，對於疾病如何進展或替代干預如何改變其軌跡提供有限的洞察。醫療世界模型將人工智慧中的世界模型概念應用於醫療保健，通過學習患者狀態動態的內部模擬器。它們的長期目標是幫助臨床醫生預測惡化、比較治療條件下的未來並為個別患者量身定制護理。然而，相關工作仍然分散在基礎模型、縱向建模、疾病模擬、治療效果估算、強化學習和數位雙胞胎之間。為了彌補這一差距，本綜述概述了一個推進醫療人工智慧的路線圖，從孤立的診斷和預測轉向模擬疾病演變並支持干預決策的醫療世界模型。這個路線圖圍繞三個相互關聯的能力組織：患者狀態構建、臨床動態建模和干預決策支持。在代表性系統中，這一比較突顯了每個能力的貢獻，以及如何將部分組件整合到更成熟的感知--動態--規劃系統中。最後，我們確定了將可行的推廣轉變為臨床有用模擬器所面臨的挑戰。相關文獻可在 https://github.com/1999kevin/awesome_medical_world_models 獲得。

Is Your Trajectory Displacement Safe in Long-tail?

2606.16313v1 by Qiao Sun, Weicheng Zheng, Yixin Huang, Hang Zhao

Long-tail scenarios remain a major bottleneck for autonomous driving evaluation, even as datasets grow by orders of magnitude. Existing evaluation pipelines are rarely human-aligned, safety-aware, verifiable, and explainable at the same time: closed-loop metrics often saturate among strong planners, while unstructured human ratings can be noisy without a carefully designed protocol. We formulate planning evaluation as additional-threat detection: given a planner trajectory and an expert reference, does the planner's displacement introduce new unsafe driving behavior? We propose FluidTest, an evaluation pipeline with three components: a pairwise WebUI protocol for reliable human annotation; a taxonomy of 32 semantic threats with evidence-grounded decision graphs; and a three-agent verification system with reflection for precision and auditability. Experiments on the WOD-E2E dataset show that FluidTest produces consistent labels among trained annotators and identifies additional threats in 65% of Poutine trajectories and 51% of RAP trajectories. These results show that state-of-the-art planners can still exhibit substantial safety-relevant failures despite high Rater Feedback Scores (RFS) and low Average Displacement Error (ADE). Additional details, guidance, and code are available at https://fluidtest.web.app.

摘要：長尾場景仍然是自動駕駛評估的一個主要瓶頸，即使數據集的規模增長了幾個量級。現有的評估管道很少同時符合人類對齊、安全意識、可驗證和可解釋的要求：閉環指標在強大的規劃者之間往往會飽和，而無結構的人類評分在沒有精心設計的協議下可能會產生噪音。我們將規劃評估定義為額外威脅檢測：給定一個規劃者的軌跡和一個專家參考，該規劃者的位移是否引入了新的不安全駕駛行為？我們提出了FluidTest，一個具有三個組件的評估管道：一個用於可靠人類註釋的成對WebUI協議；一個包含32種語義威脅的分類法，並附有基於證據的決策圖；以及一個具有反思的三代理驗證系統，以提高精確性和可審計性。在WOD-E2E數據集上的實驗顯示，FluidTest在訓練過的註釋者之間產生了一致的標籤，並在65%的Poutine軌跡和51%的RAP軌跡中識別了額外的威脅。這些結果表明，儘管Rater Feedback Scores (RFS)高且Average Displacement Error (ADE)低，最先進的規劃者仍然可能出現實質性的安全相關失敗。更多詳細信息、指導和代碼可在 https://fluidtest.web.app 獲得。

PACT: Privileged Trace Co-Training for Multi-Turn Tool-Use Agents

2606.16215v1 by Zhenbang Du, Jun Luo, Zhiwei Zheng, Xiangchi Yuan, Kejing Xia, Dachuan Shi, Qirui Jin, Qijia He, Shaofeng Zou, Yingbin Liang, Wenke Lee

Multi-turn tool-use agents must reason, call tools, and adapt to observations across several interaction turns. Post-training such agents is challenging, as reinforcement learning often suffers from sparse rewards and weak credit assignment despite matching the prompt-only inference setting, while supervised fine-tuning on expert traces provides dense process supervision but can over-constrain the model to fixed trajectories. To tackle this, we propose PACT, a Privileged trAce Co-Training framework for multi-turn tool-use agents. The key idea is to use expert traces only as training-time optimization signals rather than rollout-time hints. PACT keeps rollout generation prompt-only, then uses expert traces to guide optimization through two complementary signals: a trace-conditioned RL surrogate that evaluates prompt-only rollouts under expert-trace context, and a component-aware SFT loss that supervises reasoning prefixes and tool-calls with annealed strength. To reduce over-reliance on the training-only trace context, PACT further introduces a prompt-only anchoring. We also provide a latent-trace view that connects the two trace-based objectives and explains how expert traces can guide optimization without being used during rollout generation. Experiments on FTRL, BFCL, and ToolHop show that PACT consistently improves over strong SFT- and RL-based baselines, highlighting the value of privileged trace co-training for multi-turn tool-use learning.

摘要：多回合工具使用代理必須進行推理、調用工具並適應多次互動回合中的觀察。訓練後這類代理是具有挑戰性的，因為強化學習通常面臨稀疏獎勵和弱信用分配的問題，儘管它與僅提示的推理設置相匹配，而在專家痕跡上進行監督微調則提供了密集的過程監督，但可能會使模型過度約束於固定的軌跡。為了解決這個問題，我們提出了PACT，一種用於多回合工具使用代理的特權痕跡共同訓練框架。其關鍵思想是僅在訓練期間將專家痕跡用作優化信號，而不是在回合期間提供提示。PACT保持回合生成僅基於提示，然後利用專家痕跡通過兩個互補信號來指導優化：一個基於痕跡的強化學習替代品，該替代品在專家痕跡上下文中評估僅基於提示的回合，以及一個組件感知的SFT損失，該損失以逐漸減弱的強度監督推理前綴和工具調用。為了減少對僅訓練痕跡上下文的過度依賴，PACT進一步引入了僅基於提示的錨定。我們還提供了一個潛在痕跡視圖，該視圖將兩個基於痕跡的目標連接起來，並解釋了專家痕跡如何在不被用於回合生成的情況下指導優化。在FTRL、BFCL和ToolHop上的實驗顯示，PACT始終優於強大的基於SFT和RL的基線，突顯了特權痕跡共同訓練在多回合工具使用學習中的價值。

Embedded Arena: Iterative Optimization via Hardware Feedback

2606.16190v1 by Zhihan Zhang, Alexander Le Metzger, Jiuyang Lyu, Chun-Cheng Chang, Jiayi Shao, Yujia Liu, Emmanuel Azuh Mensah, Edward Wang, Kurtis Heimerl, Gregory D. Abowd, Shwetak Patel, Natasha Jaques, Vikram Iyer

Embedded devices from wildlife monitoring stations to clinical wearables require local AI inference due to latency, communication, or privacy constraints. Optimizing models for heterogeneous microcontrollers (MCUs) requires simultaneously satisfying hard physical constraints on memory, power, and temperature while preserving accuracy, a multidimensional optimization that is today performed manually by experts. We ask whether an LLM agent can autonomously navigate this complex, multi-turn pipeline guided by real hardware feedback, and introduce a hardware-in-the-loop agent arena in which the agent iteratively refines both model and firmware -- compiling, flashing, and measuring on real hardware -- to enable closed-loop optimization. Frontier models, including Claude Opus 4.7 and Gemini 3.1 Pro, fail entirely without hardware feedback (0% deployment success), whereas our hardware-in-the-loop formulation achieves the first successful deployment within three iterations and can surpass human expert results within seven. This agentic co-optimization achieves 250x compression for vision models with <3.3% accuracy loss and 400x for audio with <6% Feature Error Rate loss, enabling battery-free operation on a commercial MCU via solar harvesting. We demonstrate practical impact in two real-world systems: an elk-detection camera trap (96.7% accuracy) and a phonetic-transcription wearable (8.44% FER) for child development research.

摘要：嵌入式設備從野生動物監測站到臨床可穿戴設備，由於延遲、通信或隱私限制，需要進行本地 AI 推斷。對於異構微控制器 (MCUs) 優化模型需要同時滿足對記憶體、功耗和溫度的嚴格物理限制，同時保持準確性，這是一個多維優化，今天由專家手動執行。我們詢問一個 LLM 代理是否能夠自主導航這個複雜的多輪流程，並受到實際硬體反饋的指導，並介紹一個硬體在迴路的代理競技場，在這裡代理反覆精煉模型和韌體——在實際硬體上編譯、閃存和測量——以實現閉環優化。前沿模型，包括 Claude Opus 4.7 和 Gemini 3.1 Pro，完全依賴硬體反饋失敗（0% 部署成功），而我們的硬體在迴路的公式在三次迭代內實現了第一次成功部署，並且在七次迭代內可以超越人類專家的結果。這種代理協同優化為視覺模型實現了 250 倍壓縮，準確性損失小於 3.3%，音頻則實現了 400 倍壓縮，特徵錯誤率損失小於 6%，使得通過太陽能收集在商業 MCU 上實現無電池操作。我們在兩個現實世界系統中展示了實際影響：一個麋鹿檢測相機陷阱（96.7% 準確率）和一個語音轉錄可穿戴設備（8.44% 特徵錯誤率）用於兒童發展研究。

LiteOdyssey: A Lightweight Reasoning AI Agent for Interpretable Rare-Disease Diagnosis

2606.16149v1 by Minh-Ha Nguyen, Erica Gray, Chih-Ting Yang, Rizwan Hamid, Lingyao Li, Siyuan Ma, Thomas A. Cassini, Cathy Shyr

Most medical AI systems improve by scaling additional machinery: more fine-tuning data, more agents, and/or larger retrieval databases. In rare-disease diagnosis, however, such scaling can produce systems that are difficult to deploy, audit, and maintain. We asked whether state-of-the-art diagnostic performance could instead be achieved by extending the reasoning chain of a single AI agent: guiding it with a diagnostic policy, developed through human-AI collaboration and augmenting with freely available biomedical tools. We introduce LiteOdyssey, a lightweight rare-disease diagnostic framework that guides reasoning language model through a clinical genetics workflow. This framework was developed through Policy Iteration with Human Feedback (PIHF) and uses dynamic access to public biomedical tools. On two challenging benchmarks that provide only patient clinical features, LiteOdyssey achieved state-of-the-art performance, with an overall disease Recall@1 of 59.3% over the combined 1,243 cases of LIRICAL (n = 370) and the PhenoPacket Store (n = 873). Both benchmarks have a high proportion of ultra-rare disease (a prevalence below 1 in 1,000,000, with ultra-rare shares of approximately 45% and 52.8%, respectively). On the more difficult PhenoPacket subset, where causal diseases were not mapped to Orphanet in our rarity-mapping pipeline, LiteOdyssey achieved 60.7% Recall@1, compared with 10.7% for the same baseline model (GPT-5.4) without tools. This performance was achieved without fine-tuning, multi-agent ensembles, or a large case-retrieval database. Gains were also observed in the following: on cases never seen during development, on a private cohort of real-world rare disease patients, and on a smaller open-weights model. LiteOdyssey suggests a path toward rare-disease AI systems that are accurate, easier to deploy, and more transparent for physician review.

摘要：大多數醫療人工智慧系統透過擴展額外的機器來提高性能：更多的微調數據、更多的代理和/或更大的檢索數據庫。然而，在罕見疾病診斷中，這種擴展可能會產生難以部署、審核和維護的系統。我們詢問是否可以通過擴展單個人工智慧代理的推理鏈來實現最先進的診斷性能：通過人類與人工智慧的合作開發診斷政策來引導它，並使用免費的生物醫學工具進行增強。我們介紹了LiteOdyssey，一個輕量級的罕見疾病診斷框架，通過臨床遺傳學工作流程引導推理語言模型。這個框架是通過人類反饋的政策迭代（PIHF）開發的，並使用對公共生物醫學工具的動態訪問。在兩個具有挑戰性的基準上，LiteOdyssey在僅提供患者臨床特徵的情況下實現了最先進的性能，整體疾病的Recall@1為59.3%，涵蓋了1,243個LIRICAL（n = 370）和PhenoPacket Store（n = 873）的案例。這兩個基準中超罕見疾病的比例很高（流行率低於1/1,000,000，超罕見的比例分別約為45%和52.8%）。在更具挑戰性的PhenoPacket子集上，因為因果疾病未在我們的稀有映射管道中映射到Orphanet，LiteOdyssey實現了60.7%的Recall@1，而同一基準模型（GPT-5.4）在未使用工具的情況下僅為10.7%。這一性能是在沒有微調、多代理集成或大型案例檢索數據庫的情況下實現的。還觀察到以下增益：在開發期間從未見過的案例上、在一個真實世界罕見疾病患者的私有隊列上，以及在一個較小的開放權重模型上。LiteOdyssey暗示了朝向準確性高、易於部署且對醫生審查更透明的罕見疾病人工智慧系統的發展道路。

XAI-Grounded Explanation Generation for Speech Deepfake Detection with Training-Free Multimodal Large Language Models

2606.16137v1 by Yupei Li, Qiyang Sun, Xiaoliang Wu, Chenxi Wang, Berrak Sisman, Björn W. Schuller

Speech deepfake detection (SDD) systems require trustworthy explanations for reliable decision-making. Existing explanation ways mainly fall into two categories. Traditional explainable AI (XAI), such as gradient-based attribution, produces low-level attribution signals tightly coupled with model decisions, and harder to be understood by human than natural language explanations. Meanwhile, large language model (LLM)-based explanation generation often produces generic and ungrounded descriptions due to the lack of heuristic evidence and task-specific supervision, stemming from limited grounded explanation datasets for SDD. We therefore propose a training-free explanation framework that integrates XAI evidence with multimodal LLMs to generate grounded and specific explanations. Using the PartialSpoof dataset, we construct a grounded explanation dataset and show that methods with XAI increase inside accuracy by over 45\%, verified through human evaluation and faithfulness checks.

摘要：語音深偽檢測（SDD）系統需要可信的解釋以進行可靠的決策。現有的解釋方式主要分為兩類。傳統的可解釋人工智慧（XAI），例如基於梯度的歸因，產生與模型決策緊密相關的低階歸因信號，且比自然語言解釋更難以被人理解。與此同時，基於大型語言模型（LLM）的解釋生成常常因缺乏啟發式證據和任務特定的監督而產生通用且缺乏根據的描述，這源於SDD的有限根據解釋數據集。因此，我們提出了一個無需訓練的解釋框架，將XAI證據與多模態LLM整合，以生成有根據且具體的解釋。使用PartialSpoof數據集，我們構建了一個有根據的解釋數據集，並顯示使用XAI的方法使內部準確率提高了超過45\%，這一結果通過人類評估和可信度檢查得到了驗證。

SciText2Eq: Assessing LLMs for Explainable Equation Generation for Scientific Creativity

2606.16003v1 by Yifan Mo, Xiao Fu, Yue Su, Qingyu Meng, Koen Hindriks, Qingzhi Liu, Jiahuan Pei

This work investigates the ability of large language models (LLMs) to generate mathematical equations from scientific texts. Prior work faces challenges in unstructured grounding, multi-equation dependency, and humanaligned evaluation. To this end, we construct a dataset of AI research papers, pairing contextual passages with ground-truth equations and variable descriptions. We develop an explainable equation generation workflow and evaluate it across diverse open- and closed-source LLM backbones. We introduce an evaluation protocol combining automatic metrics, LLM-based rubrics, and human judgments to assess accuracy, explainability, and human-LLM alignment. Results indicate that LLMs perform moderately on lexical- and syntactic-based similarity, while struggling with semantic accuracy. Comparisons between LLM-based evaluations and human judgments reveal limited alignment, highlighting challenges in using LLMs to assess equation quality. These findings offer insights for improving equation generation models and developing more reliable evaluation methods for scientific text. We provide code and data for reproducibility.

摘要：這項研究探討大型語言模型（LLMs）從科學文本生成數學方程式的能力。先前的研究面臨著非結構化基礎、多方程式依賴和人類對齊評估的挑戰。為此，我們構建了一個AI研究論文數據集，將上下文段落與真實方程式和變量描述配對。我們開發了一個可解釋的方程式生成工作流程，並在多種開源和閉源的LLM骨幹上進行評估。我們引入了一個評估協議，結合自動指標、基於LLM的評分標準和人類判斷，以評估準確性、可解釋性和人類-LLM對齊。結果顯示，LLMs在詞彙和句法基礎的相似性上表現中等，但在語義準確性上表現不佳。LLM基礎的評估與人類判斷之間的比較顯示出有限的對齊，突顯了使用LLMs評估方程式質量的挑戰。這些發現為改進方程式生成模型和開發更可靠的科學文本評估方法提供了見解。我們提供了代碼和數據以便重現。

Entity Labels Are Not Entity Signals: A Framework for Observable Relevance in Document Re-Ranking

2606.15998v1 by Utshab Kumar Ghosh, Shubham Chatterjee

Entity-aware document retrieval uses query-associated entities as ranking signals, assuming that semantically relevant entities are also useful retrieval signals. We show this assumption is insufficient- and explain why. Unlike terms, which are ground-truth observations, entity links are hypotheses produced by an imperfect linker: an entity can be topically central yet provide no discriminative signal if the linker fires indiscriminately across relevant and non-relevant documents. We formalize this as a distinction between Conceptual Entity Relevance (CER)- whether an entity is topically related to a query- and Observable Entity Relevance (OER)- whether its observed presence in a collection discriminates relevant from non-relevant documents. Across four collections and annotation sources including human entity judgments, CER and OER exhibit near-chance agreement ($κ\approx 0$), while OER operationalizations agree substantially ($κ\approx 0.5$), confirming CER as the systematic outlier. CER-based supervision selects topically plausible but weakly discriminative entities, pruning fewer than 4% of non-relevant documents on some collections. Aligning supervision with OER improves non-relevant pruning by up to 10x and open-world MAP by 0.051 over BM25. Our findings motivate a shift from conceptual to observable notions of entity relevance in entity-aware retrieval.

摘要：實體感知文件檢索使用與查詢相關的實體作為排名信號，假設語義上相關的實體也是有用的檢索信號。我們顯示這一假設是不充分的，並解釋原因。與術語不同，術語是基於真實觀察的，而實體鏈接是由不完美的鏈接器產生的假設：如果鏈接器在相關和不相關的文檔中隨意觸發，則一個實體可以在主題上是中心的，但卻不提供任何區分信號。我們將此正式化為概念實體相關性（CER）與可觀察實體相關性（OER）之間的區別——CER是指一個實體是否與查詢在主題上相關，而OER是指其在集合中的觀察存在是否能區分相關和不相關的文檔。在四個集合和標註來源中，包括人類實體判斷，CER和OER的協議接近隨機（$κ\approx 0$），而OER的操作化則顯著一致（$κ\approx 0.5$），確認CER作為系統性異常。基於CER的監督選擇在主題上合理但區分能力較弱的實體，在某些集合中修剪不到4%的不相關文檔。將監督與OER對齊可將不相關文檔的修剪提高至10倍，並使開放世界MAP相較於BM25提高0.051。我們的研究結果促使從概念性實體相關性轉向可觀察的實體相關性在實體感知檢索中的應用。

DeepRoot: A KG-Coordinated Multi-Agent System for Therapeutic Reasoning over Historical Medical Texts

2606.15931v1 by Zijian Carl Ma, Sean J. Wang, Sijbren Kramer, Li Erran Li

Historical medical archives and traditional medicines hold immense potential for drug discovery and remain a primary source for current drug development. However, pre-ontological prose and idiosyncratic taxonomies prevent the standardization and medical modernization of the data for use in current biomedical pipelines. Furthermore, no existing LLM agent system, whether tool-calling, retrieval-augmented, or agentic deep-research, can convert such text into verifiable drug-discovery leads at scale. We close this gap with DeepRoot, a multi-agent LLM system that jointly builds and utilizes a verified knowledge graph, showing that grounding and reasoning -- often conflated -- are separable axes the system can compose for therapeutic reasoning. Applied to the Shen Nong Ben Cao Jing, DeepRoot recovers $10$ of $21$ held-out compound-disease treatment pairs at R@$20$ ($47.6\%$ vs $4.8\%$ for a raw corpus LLM and $\sim!2.4\%$ random) and dominates an LLM-as-judge audit for reasoning quality over baseline LLMs and LLMs with direct tool-call access to the same APIs DeepRoot itself queries. Tool-using LLMs hallucinate evidence on $87\%$ of claims, versus 7-10% for DeepRoot. Graph-only inference hallucinates $0\%$ but ranks lowest on reasoning coherence; DeepRoot KG+LLM is the only condition to win on both axes, pointing toward a route for systematic mining and repurposing of historical medical knowledge.

摘要：歷史醫療檔案和傳統醫藥在藥物發現方面具有巨大的潛力，並且仍然是當前藥物開發的主要來源。然而，前本體論的散文和特有的分類法阻礙了數據的標準化和醫療現代化，以便用於當前的生物醫學管道。此外，現有的 LLM 代理系統，無論是工具調用、檢索增強還是代理深度研究，都無法將這類文本轉換為可驗證的藥物發現線索。我們通過 DeepRoot 彌補了這一空白，這是一個多代理 LLM 系統，聯合構建和利用經過驗證的知識圖譜，顯示出基礎和推理——通常被混淆——是系統可以組合的可分離軸，用於治療推理。應用於《神農本草經》，DeepRoot 恢復了 $10$ 個 $21$ 個保留的化合物-疾病治療對，在 R@$20$ 下的表現為 $47.6\%$（相比之下，原始語料庫 LLM 為 $4.8\%$，隨機約 $2.4\%$），並在推理質量的 LLM 作為評審的審計中超越了基準 LLM 和直接工具調用訪問相同 API 的 LLM，這些 API 是 DeepRoot 自身查詢的。使用工具的 LLM 在 $87\%$ 的主張上出現幻覺，而 DeepRoot 的比例為 7-10%。僅圖譜推理的幻覺為 $0\%$，但在推理連貫性上排名最低；DeepRoot KG+LLM 是唯一在兩個軸上都獲勝的條件，指向系統性挖掘和重新利用歷史醫療知識的路徑。

DifFRACT: Diffusion Feature Reconstruction and Attribution for Circuit Tracing

2606.15796v1 by Artyom Mazur, Nina Konovalova, Aibek Alanov

Mechanistic interpretability seeks to explain neural network behavior by decomposing model computations into interpretable features and circuits. While transcoder-based circuit tracing has recently enabled detailed causal analyses of large language models, multimodal diffusion transformers for image generation remain comparatively opaque. We still lack tools for understanding how semantic information propagates across denoising steps and how text and image representations interact within double-stream MM-DiT architectures. Existing methods provide only partial insight: attention maps expose a limited view of token interactions, while sparse autoencoders can discover interpretable features but do not directly reveal how these features are transformed and composed through nonlinear MLP layers. In this work, we extend transcoder-based circuit tracing to multimodal diffusion transformers. We train timestep-conditioned transcoders that faithfully approximate the input-output behavior of MLP sublayers in FLUX.1[schnell]. By replacing MLPs with transcoders and linearizing the remaining computation, we obtain exact feature-to-feature attribution and recover compact, interpretable circuits. Empirically, our transcoders match or slightly outperform sparse autoencoders on the sparsity-faithfulness tradeoff. The resulting circuits reveal mechanisms underlying attribute binding and cross-stream semantic propagation, and provide causal explanations for systematic generation errors. Moreover, circuit-guided interventions are substantially more precise and effective than standard SAE-based steering. Our results demonstrate that transcoder-based circuit analysis is feasible for state-of-the-art diffusion transformers and provides a powerful framework for understanding and controlling multimodal generative models. The code is available at https://github.com/Artalmaz31/DifFRACT

摘要：機械解釋性旨在通過將模型計算分解為可解釋的特徵和電路來解釋神經網絡的行為。雖然基於轉碼器的電路追蹤最近使大型語言模型的詳細因果分析成為可能，但用於圖像生成的多模態擴散Transformer仍然相對不透明。我們仍然缺乏理解語義信息如何在去噪步驟中傳播以及文本和圖像表示如何在雙流MM-DiT架構中互動的工具。現有方法僅提供部分見解：注意力圖揭示了標記互動的有限視圖，而稀疏自編碼器可以發現可解釋的特徵，但不直接顯示這些特徵如何通過非線性MLP層轉換和組合。在這項工作中，我們將基於轉碼器的電路追蹤擴展到多模態擴散Transformer。我們訓練了時間步驟條件的轉碼器，這些轉碼器忠實地近似FLUX中的MLP子層的輸入輸出行為。通過用轉碼器替換MLP並線性化剩餘計算，我們獲得了精確的特徵到特徵的歸因，並恢復了緊湊的可解釋電路。經驗上，我們的轉碼器在稀疏性與忠實性之間的權衡上與稀疏自編碼器相匹配或稍有優於。所得到的電路揭示了屬性綁定和跨流語義傳播的機制，並為系統生成錯誤提供因果解釋。此外，基於電路的干預在精確性和有效性上顯著優於標準的SAE基礎引導。我們的結果表明，基於轉碼器的電路分析對於最先進的擴散Transformer是可行的，並提供了一個強大的框架來理解和控制多模態生成模型。代碼可在 https://github.com/Artalmaz31/DifFRACT 獲得。

Visualizing Uncertainty: Spatial Maps of Missing and Conflicting Evidence in Deep Learning

2606.15767v1 by Dong Hyun Jeong, Feng Chen, Jin-Hee Cho, Lance M. Kaplan, Audun Jøsang, Soo-Yeon Ji

Understanding when and why deep neural networks are uncertain is crucial for deploying reliable machine learning systems in safety-critical domains. While existing uncertainty quantification methods provide scalar measures of model confidence, they offer limited insight into which spatial regions of an input contribute to different types of uncertainty. We propose a novel visualization framework, Uncertainty Activation Map (UAM), that combines Evidential Deep Learning (EDL) with Full-Gradient Class Activation Mapping (FullGrad) to generate interpretable spatial uncertainty activation maps. Our approach distinguishes between two fundamental types of uncertainty: vacuity, representing lack of evidence, and dissonance, capturing conflicting evidence between competing hypotheses. By leveraging the complete gradient decomposition property of FullGrad and the principled uncertainty quantification of Subjective Logic, our method produces theoretically grounded visualizations that highlight specific image regions responsible for model uncertainty. With this framework, vacuity and dissonance activation maps are generated by computing belief-weighted attributions, enabling identification of where models lack knowledge versus where they encounter ambiguous evidence. Extensive evaluations across multiple benchmark datasets demonstrate that the proposed framework effectively addresses the critical gap between uncertainty quantification and explainability, providing intuitive visual feedback to assess model reliability in complex visual recognition tasks.

摘要：理解深度神經網絡何時以及為何存在不確定性，對於在安全關鍵領域部署可靠的機器學習系統至關重要。雖然現有的不確定性量化方法提供了模型信心的標量度量，但對於輸入的哪些空間區域對不同類型的不確定性有貢獻的洞察有限。我們提出了一種新穎的可視化框架，不確定性激活圖（UAM），它將證據深度學習（EDL）與全梯度類別激活映射（FullGrad）結合，以生成可解釋的空間不確定性激活圖。我們的方法區分了兩種基本的不確定性類型：空虛，代表缺乏證據，和不和諧，捕捉競爭假設之間的衝突證據。通過利用FullGrad的完整梯度分解特性和主觀邏輯的原則性不確定性量化，我們的方法產生了理論上有根據的可視化，突顯出負責模型不確定性的特定圖像區域。通過這個框架，空虛和不和諧激活圖是通過計算信念加權的歸因來生成的，從而能夠識別模型缺乏知識的地方以及遇到模糊證據的地方。在多個基準數據集上的廣泛評估表明，所提出的框架有效地解決了不確定性量化與可解釋性之間的關鍵差距，提供了直觀的視覺反饋，以評估模型在複雜視覺識別任務中的可靠性。

InstantForget: Update-Free Backdoor Unlearning with Inference-Time Feature Reset

2606.15730v1 by Zhenyu Yu

Backdoor unlearning aims to remove a malicious trigger behavior from a deployed model while preserving clean utility. We study the update-free inference-time setting, where model parameters remain frozen. First, we audit a common projection assumption under oracle paired clean and triggered features. Projection succeeds mainly on BadNets and leaves WaNet, Blended, and SIG at 0.683, 0.888, and 0.941 ASR on CIFAR-10 ResNet-18. This failure is not explained by spectral compactness, spatial locality, or subspace misalignment. It is predicted by a logit-triplet gap involving the target margin, target-logit drop, and non-target logit rise. We then introduce InstantForget, a clean-calibrated gated reset that flags anomalous features with a Mahalanobis score and moves only flagged features toward a neutral non-target representation. With one fixed operating point selected on held-out triggered validation, InstantForget reduces average ASR to 0.071 across four non-adaptive CIFAR-10 triggers without triggered samples or parameter updates at deployment. It also reaches 0.981 detection AUROC and transfers to six of eight tested backbones. Reported failures under WaNet, ModelNet10 point blend, two backbone geometries, and adaptive feature-compactness attacks define the method's scope.

摘要：後門去學習旨在從已部署的模型中移除惡意觸發行為，同時保持清潔效用。我們研究了無更新推斷時間設置，其中模型參數保持不變。首先，我們審核了在預言者配對清潔和觸發特徵下的一個常見投影假設。投影主要在 BadNets 上成功，並在 CIFAR-10 ResNet-18 上留下 WaNet、Blended 和 SIG 的 ASR 分別為 0.683、0.888 和 0.941。這一失敗無法通過光譜緊湊性、空間局部性或子空間不對齊來解釋。它是由涉及目標邊際、目標邏輯下降和非目標邏輯上升的邏輯三元組差距預測的。我們隨後介紹了 InstantForget，一種清潔校準的門控重置，它通過馬哈拉諾比斯分數標記異常特徵，並僅將標記的特徵移向中性非目標表示。在保留的觸發驗證中選擇一個固定的操作點，InstantForget 在四個非自適應的 CIFAR-10 觸發器中將平均 ASR 降低至 0.071，而不需要觸發樣本或在部署時更新參數。它還達到了 0.981 的檢測 AUROC，並轉移到八個測試骨幹中的六個。報告中在 WaNet、ModelNet10 點混合、兩個骨幹幾何和自適應特徵緊湊性攻擊下的失敗定義了該方法的範疇。

AI-Driven Framework for Adaptive Water Network Management with Proof-of-Concept Implementation: Addressing Non-Revenue Water in Jordan

2606.15709v1 by Mohammed Fasha, Nahel Al-Maayta, Bilal Sowan, Mohammad Athamneh, Husam Barham

Jordan faces severe water scarcity with 50\% of water produced is lost to leakage, theft and metering issues also known as non-revenue water (NRW). Traditional reactive approaches have proven insufficient for sustained NRW reduction. This paper proposes an intelligent framework integrating EPANET hydraulic modeling, digital twin technology, SCADA systems, and large language model (LLM)-based AI agents for continuous network monitoring and adaptive decision-making. The system combines real-time data streams with physics-based simulation to detect anomalies, employing retrieval-augmented generation (RAG) for policy interpretation and function calling for network control. A proof-of-concept implementation validates technical feasibility using EPYT with offline LLMs (llama3.1:8b via Ollama) on a 1,164-junction Amman district network. The system demonstrates automated hydraulic simulation, flow-based anomaly detection aligned with water distribution zone (DZ) practice, and AI-generated health reports with response times under 2 minutes and zero API costs. Burst detection relies on local flow anomaly analysis: a 30.1~L/s simulated leak produces measurable flow redistribution in 15 pipes, flagging a 15-junction cluster that localises the burst -- confirming alignment with water distribution zone (DZ) monitoring practice. The framework accommodates Jordan's intermittent supply patterns and limited automation through phased implementation, offering a scalable pathway for water-scarce regions to leverage intelligent automation for NRW reduction and operational efficiency.

摘要：約旦面臨嚴重的水資源短缺，50\% 的水產量因漏水、盜竊和計量問題而損失，這也被稱為非收入水 (NRW)。傳統的反應性方法已被證明不足以持續減少 NRW。本文提出了一個智能框架，整合了 EPANET 水力模型、數位雙胞胎技術、SCADA 系統和基於大型語言模型 (LLM) 的 AI 代理，用於持續的網絡監控和自適應決策。該系統結合了實時數據流和基於物理的模擬來檢測異常，採用檢索增強生成 (RAG) 進行政策解釋，並通過函數調用進行網絡控制。概念驗證實施使用 EPYT 和離線 LLM（llama3.1:8b 通過 Ollama）在 1,164 個接頭的安曼區網絡上驗證了技術可行性。該系統展示了自動化的水力模擬、基於流量的異常檢測，與水分配區 (DZ) 實踐相一致，並生成 AI 健康報告，響應時間在 2 分鐘以內且無 API 成本。爆裂檢測依賴於本地流量異常分析：一個 30.1~L/s 的模擬漏水在 15 根管道中產生可測量的流量重分配，標記出一個 15 接頭的集群，定位了爆裂點——確認與水分配區 (DZ) 監控實踐的一致性。該框架通過分階段實施，適應約旦的間歇性供應模式和有限的自動化，為水資源短缺的地區提供了利用智能自動化減少 NRW 和提高運營效率的可擴展途徑。

Quantum Cinema: An Interactive Cinematic Exploration of Quantum Computing Hardware via Generative World Models

2606.17102v1 by Aoyu Zhang, Dongping Liu, Luyao Zhang

Quantum computing promises transformative advances across science and industry, yet the physical hardware that enables these computations remains invisible to the public: quantum processors operate inside sealed dilution refrigerators at temperatures near absolute zero, making direct observation impossible. This "imagination gap" between quantum computing's growing societal impact and the public's ability to visualize it represents a significant barrier to quantum literacy and workforce development. We present Quantum Cinema, an open-source, browser-based interactive application that closes this gap by transforming invisible quantum hardware into explorable, cinematic experiences using generative world models. Quantum Cinema guides users through a four-act narrative -- from the foundational Nobel Prize-winning science of quantum entanglement, through curated video introductions to three major quantum computing architectures (trapped-ion, neutral-atom, and superconducting systems), into immersive three-dimensional generative worlds that make invisible quantum phenomena observable, and finally to interactive radar-chart comparisons grounded in real quantum device specifications. All three-dimensional environments are generated using WorldLabs' generative world model platform and are scientifically grounded in curated metrics from Amazon Web Services (AWS) Braket quantum hardware. Quantum Cinema requires no installation, no specialized hardware, and no quantum computing background. It is designed to serve two distinct communities: scholars and developers seeking to replicate or extend the platform, and educators, researchers, and science communicators seeking an intuitive tool for explaining quantum hardware to diverse audiences. This paper describes the system architecture, the generative world model pipeline, use cases for both communities, and directions for future work.

摘要：量子計算承諾在科學和工業領域帶來變革性的進展，然而，支持這些計算的物理硬體對公眾來說仍然是隱形的：量子處理器在接近絕對零度的密封稀釋冰箱內運行，使得直接觀察變得不可能。量子計算日益增長的社會影響與公眾可視化能力之間的這種「想像差距」代表了量子素養和勞動力發展的一個重要障礙。我們提出了量子影院（Quantum Cinema），這是一個開源的基於瀏覽器的互動應用程序，通過使用生成性世界模型將隱形的量子硬體轉化為可探索的電影體驗，來縮小這一差距。量子影院引導用戶通過四幕敘事——從量子糾纏的基礎諾貝爾獎科學，通過對三種主要量子計算架構（捕獲離子、中性原子和超導系統）的精選視頻介紹，進入使隱形量子現象可觀察的沉浸式三維生成世界，最後到基於真實量子設備規格的互動雷達圖比較。所有三維環境均使用WorldLabs的生成性世界模型平台生成，並基於來自亞馬遜網絡服務（AWS）Braket量子硬體的精選指標進行科學驗證。量子影院不需要安裝、不需要專業硬體，也不需要量子計算背景。它旨在服務兩個不同的社群：尋求複製或擴展該平台的學者和開發者，以及尋求直觀工具以向不同受眾解釋量子硬體的教育工作者、研究人員和科學傳播者。本文描述了系統架構、生成性世界模型管道、兩個社群的使用案例以及未來工作的方向。

Is Code Better Than Language for Algorithmic Reasoning

2606.15589v1 by Terry Tong, Yu Feng, Surbhi Goel, Dan Roth

For tool-augmented language models, comparing natural-language reasoning with code-execution pipelines is difficult because the comparison changes both the intermediate representation and the execution mechanism. We separate these factors with an intermediate intervention: the model expresses its reasoning as executable code, and the language model simulates that code in context to produce an answer. On a 40-task verifiable algorithmic benchmark, deterministic code execution outperforms natural-language reasoning by +31.6pp. We observe that the intermediate intervention is not meaningfully different from natural-language reasoning (+0.15pp). These results suggest that, in our evaluated setting, changing the intermediate representation alone does not explain the tool-use advantage, providing evidence for the performance gains requiring reliable external execution. We formalize this intuition with a simple statistical decision-theoretic model that characterizes when execution dominates end-to-end risk in our disentangled trace-generation/execution regime. We validate our theory using a reconstruction intervention that leverages a proxy language model to infer natural-language reasoning traces from code representations, recovering performance comparable to the original natural-language reasoning pipeline. All experiments are at https://github.com/TerryTong-Git/ToolProj.

摘要：對於工具增強的語言模型，將自然語言推理與代碼執行管道進行比較是困難的，因為這種比較會改變中間表示和執行機制。我們通過一個中間介入來分離這些因素：模型將其推理表達為可執行代碼，語言模型在上下文中模擬該代碼以產生答案。在一個包含40個任務的可驗證算法基準上，確定性代碼執行的表現超過自然語言推理31.6個百分點。我們觀察到，中間介入與自然語言推理並沒有實質性區別（+0.15個百分點）。這些結果表明，在我們評估的設置中，僅改變中間表示並不能解釋工具使用的優勢，提供了性能增益需要可靠外部執行的證據。我們用一個簡單的統計決策理論模型來形式化這一直覺，該模型描述了在我們的解耦追蹤生成/執行體系中，何時執行主導端到端風險。我們使用重建介入來驗證我們的理論，該介入利用代理語言模型從代碼表示推斷自然語言推理的追蹤，恢復了與原始自然語言推理管道相當的性能。所有實驗均在 https://github.com/TerryTong-Git/ToolProj 上進行。

Service-Induced Congestion in Memory-Constrained LLM Serving

2606.15555v1 by Ruicheng Ao, Jing Dong, Gan Luo, David Simchi-Levi

In large language model (LLM) serving, each request accumulates persistent graphics processing unit (GPU) memory during service as its key-value cache grows with every generated token. Under high concurrency, aggregate memory usage therefore increases endogenously over time: the service process itself creates future capacity pressure. When memory capacity is exceeded, systems evict active requests, discarding cached state and restarting them later, which wastes computation and reduces throughput. We develop a discrete-time dynamical model of memory-constrained LLM inference that captures admission, memory growth, and eviction under continuous batching. In the saturated-input regime, the system admits both eviction-free fixed points and limit cycles with evictions. For homogeneous workloads, we show that the eviction-free equilibrium is unstable and that, except for a Lebesgue-measure-zero exact-capture set, the system converges to a unique worst-case limit cycle that is asymptotically stable outside this exceptional set, with throughput losses as large as 50%. For heterogeneous workloads, we prove a stability criterion in the two-class common-input setting and explain how the survival-polynomial mechanism generalizes to multiple classes and heterogeneous-input lengths. Under an input-dominated scaling regime, coprime decoding lengths stabilize the eviction-free equilibrium, while non-coprime lengths create synchronized modes that drive instability. These results characterize when workload heterogeneity desynchronizes completions and helps stabilize memory-constrained serving. More broadly, we identify service-induced congestion as a structural instability mechanism and derive scheduling design principles for sustaining high throughput.

摘要：在大型語言模型（LLM）服務中，每個請求在服務過程中隨著每個生成的標記而累積持久的圖形處理單元（GPU）記憶體，以增長其鍵值快取。在高併發的情況下，總記憶體使用量因此隨時間內生性地增加：服務過程本身創造了未來的容量壓力。當記憶體容量被超過時，系統會驅逐活躍請求，丟棄快取狀態並在稍後重新啟動它們，這會浪費計算資源並降低吞吐量。我們開發了一個記憶體受限的 LLM 推理的離散時間動態模型，捕捉在持續批處理下的接納、記憶體增長和驅逐。在飽和輸入範疇中，系統同時接納無驅逐的固定點和帶有驅逐的極限週期。對於同質工作負載，我們顯示無驅逐的平衡是不穩定的，並且除了勒貝格測度為零的精確捕獲集外，系統收斂到一個唯一的最壞情況極限週期，該週期在這個例外集之外是漸近穩定的，吞吐量損失高達 50%。對於異質工作負載，我們在兩類共同輸入設置中證明了一個穩定性準則，並解釋生存多項式機制如何推廣到多個類別和異質輸入長度。在輸入主導的縮放範疇下，互質的解碼長度穩定了無驅逐的平衡，而非互質的長度則創造了驅動不穩定性的同步模式。這些結果表徵了工作負載異質性何時使完成不同步並幫助穩定記憶體受限的服務。更廣泛地說，我們將服務引起的擁堵識別為一種結構性不穩定機制，並推導出維持高吞吐量的調度設計原則。

Let LLMs Judge Each Other: Multi-Agent Peer-Reviewed Reasoning for Medical Question Answering

2606.15419v1 by Zaifu Zhan, Shuang Zhou, Rui Zhang

Objective: To enhance the accuracy, interpretability, and robustness of large language models (LLMs) in medical question answering (MedQA). Method: We designed a multi-agent peer-reviewed reasoning method in which multiple LLM agents independently generate chain-of-thought reasoning with candidate answers, then act as peer reviewers to evaluate each other's reasoning for factual correctness and logical soundness. The highest-rated reasoning chain is selected to produce the final answer. Experiments were conducted with five state-of-the-art LLMs (Llama-3.1-8B, Qwen2.5-7B, Phi-4, DeepSeek-LLM-7B, GPT-oss-20B) on three benchmark datasets: HeadQA, MedQA-USMLE, and PubMedQA. Performance was compared against single-model chain-of-thought reasoning and chain-of-thought-based majority voting. Results: Peer-reviewed reasoning consistently outperformed both baselines. The best model combination achieved an average accuracy of 0.820 across datasets, exceeding the strongest single model (0.777) and majority voting ensembles (up to 0.789). The method also scaled effectively with more participating models, while peer assessments reliably distinguished high- from low-quality reasoning chains. Conclusion: The proposed multi-agent peer-reviewed reasoning method enables LLMs to act as both solvers and evaluators, yielding superior performance in MedQA. By emphasizing reasoning quality rather than answer agreement alone, this approach improves accuracy, interpretability, and robustness, offering a promising direction for trustworthy biomedical AI systems.

摘要：目標：提升大型語言模型（LLMs）在醫學問題回答（MedQA）中的準確性、可解釋性和穩健性。
方法：我們設計了一種多代理同行評審推理方法，其中多個LLM代理獨立生成思考鏈推理及候選答案，然後作為同行評審來評估彼此的推理在事實正確性和邏輯合理性方面的表現。評分最高的推理鏈被選中以產生最終答案。實驗使用了五個最先進的LLM（Llama-3.1-8B、Qwen2.5-7B、Phi-4、DeepSeek-LLM-7B、GPT-oss-20B）在三個基準數據集上進行：HeadQA、MedQA-USMLE和PubMedQA。性能與單模型思考鏈推理和基於思考鏈的多數投票進行比較。
結果：同行評審推理始終超越了兩個基準。最佳模型組合在數據集上達到了平均準確率0.820，超過了最強的單一模型（0.777）和多數投票集成（最高達到0.789）。該方法在參與模型數量增加時也能有效擴展，而同行評估可靠地區分了高質量和低質量的推理鏈。
結論：所提出的多代理同行評審推理方法使LLMs能夠同時作為解決者和評估者，在MedQA中產生了卓越的表現。通過強調推理質量而非僅僅是答案一致性，這種方法提高了準確性、可解釋性和穩健性，為可信的生物醫學AI系統提供了有前景的方向。

APEX: Adaptive Principle EXtraction A Three-Layer Self-Evolution Framework for Production AI Agents

2606.15363v1 by Ya-Chuan Chen, Tien-Jen Lai, Hsiang-Wei Hu

Self-improvement in AI agents has emerged as a key research frontier: systems that modify their own prompts, workflows, and decision rules based on accumulated operational experience. The state-of-the-art Self-Harness framework [1] achieves 14--21% improvement on Terminal-Bench-2.0 by mining failure clusters and patching the agent harness. However, Self-Harness optimises only one dimension -- the prompt harness -- leaving behavioural principles and workflow topology unchanged. We propose APEX (Adaptive Principle EXtraction), a three-layer co-evolution framework that simultaneously evolves: (L1) the harness via failure-mode patching, (L2) behavioural principles via success-trace distillation [2], and (L3) the agent workflow topology via structural fitness-based selection [6]. We implement APEX on Joe [13], a production-grade super AI Agent built on NVIDIA Nemotron and designed as an Edge AI Agent Factory for the NVIDIA Agent Challenge 2026, managing a 15-node compute fleet using 114 real task traces collected over 18 days. APEX achieves an APEX Health Score of 0.570 (+90% vs. baseline 0.300) in a single evolutionary run, distilling 6 novel reusable principles and selecting a research-first workflow topology scoring 0.900 (+20%). Our results demonstrate that multi-dimensional co-evolution substantially outperforms single-axis harness optimisation, at a cost of only 4 LLM calls (~270 s) on a local qwen2.5-coder:32b instance.

摘要：自我改進的 AI 代理已成為一個關鍵的研究前沿：這些系統根據累積的操作經驗修改自己的提示、工作流程和決策規則。最先進的 Self-Harness 框架 [1] 通過挖掘失敗集群和修補代理鞍具，在 Terminal-Bench-2.0 上實現了 14--21% 的改進。然而，Self-Harness 只優化了一個維度——提示鞍具——而行為原則和工作流程拓撲保持不變。我們提出了 APEX（自適應原則提取），這是一個三層共演化框架，同時演化： (L1) 通過失敗模式修補來改進鞍具，(L2) 通過成功追蹤蒸餾行為原則 [2]，以及 (L3) 通過基於結構適應度的選擇來改變代理工作流程拓撲 [6]。我們在 Joe [13] 上實現了 APEX，這是一個基於 NVIDIA Nemotron 構建的生產級超 AI 代理，設計為 NVIDIA 代理挑戰 2026 的邊緣 AI 代理工廠，管理一個使用 114 個在 18 天內收集的真實任務追蹤的 15 節點計算集群。在單次進化運行中，APEX 實現了 0.570 的 APEX 健康分數（比基線 0.300 提高 90%），蒸餾出 6 個新穎可重用的原則，並選擇了一個研究優先的工作流程拓撲，得分 0.900（提高 20%）。我們的結果表明，多維共演化顯著優於單軸鞍具優化，僅在本地 qwen2.5-coder:32b 實例上消耗 4 次 LLM 調用（約 270 秒）的成本。

Adapting Reinforcement Learning with Chain-of-Thought Supervision for Explainable Detection of Hateful and Propagandistic Memes

2606.15307v1 by Mohamed Bayan Kmainasi, Mucahid Kutlu, Ali Ezzat Shahroor, Abul Hasnat, Firoj Alam

Hateful and propagandistic memes exploit the interplay between images and text to convey harmful intent that neither modality reveals alone. Although thinking-based multimodal large language models (MLLMs) have advanced vision-language understanding, their application to meme content moderation remains underexplored. We propose a reinforcement learning-based post-training method that improves classification performance and reference-based explanation quality in thinking-based MLLMs via task-specific rewards and Group Relative Policy Optimization (GRPO). Concretely, we (i) conduct a systematic empirical study of off-the-shelf MLLMs for hateful and propagandistic meme understanding across English and Arabic benchmarks, (ii) extend existing meme datasets with weakly supervised chain-of-thought (CoT) rationales via distillation and multi-LLM fine-grained propaganda annotations, (iii) introduce a GRPO-based objective with thinking-length regularization that jointly optimizes classification accuracy and explanation quality, and (iv) investigate self-supervised GRPO on unlabeled memes using consensus-based pseudo-labels. Experiments on the Hateful Memes and ArMeme benchmarks show that our approach improves over previously reported results on FHM accuracy (up to +2.1%, from 79.9% to 82.0%) and on ArMeme macro-F1 (up to +7.6 points, from 0.536 to 0.612 with explanations; +6.1 compared to the original ArMeme benchmark), while also generating natural-language explanations. On ArMeme, sequence-classification baselines remain stronger in terms of raw accuracy, whereas our approach provides more balanced per-class performance along with explanations. We publicly release our code, data extensions, and evaluation resources.

摘要：仇恨和宣傳性迷因利用圖像與文本之間的相互作用來傳達單一模式無法單獨揭示的有害意圖。儘管基於思考的多模態大型語言模型（MLLMs）在視覺-語言理解方面取得了進展，但其在迷因內容審核中的應用仍然未被充分探討。我們提出了一種基於強化學習的後訓練方法，通過任務特定的獎勵和群體相對策略優化（GRPO）來提高基於思考的MLLMs的分類性能和基於參考的解釋質量。具體來說，我們（i）對現成的MLLMs在英語和阿拉伯語基準上進行系統的實證研究，以理解仇恨和宣傳性迷因，（ii）通過蒸餾和多MLM的細粒度宣傳註釋，擴展現有的迷因數據集，並提供弱監督的思考鏈（CoT）推理，（iii）引入一種基於GRPO的目標，並進行思考長度正則化，該目標共同優化分類準確性和解釋質量，以及（iv）利用基於共識的偽標籤研究在未標記迷因上的自我監督GRPO。在仇恨迷因和ArMeme基準上的實驗顯示，我們的方法在FHM準確性（提高至+2.1%，從79.9%提升至82.0%）和ArMeme宏F1（提高至+7.6點，從0.536提升至0.612，附帶解釋；與原始ArMeme基準相比提高6.1）上均優於先前報告的結果，同時生成自然語言解釋。在ArMeme上，序列分類基準在原始準確性方面仍然更強，而我們的方法則提供了更平衡的每類性能以及解釋。我們公開發布了我們的代碼、數據擴展和評估資源。

Enabling Real-Time Point-of-Care Ultrasound Segmentation: A GPU-Free Deployment in Resource-Limited Settings

2606.15176v1 by Weihao Gao

Ultrasound imaging is the most widely adopted medical modality globally due to its low cost and portability, yet artificial intelligence (AI) deployment remains constrained by reliance on GPU-accelerated models, creating a structural paradox where the cost of "intelligence" exceeds that of the imaging device itself. Here, we present the systematic adaptation and extensive evaluation of UltraSeg, an ultra-lightweight architecture originally developed for colonoscopic polyp segmentation, now engineered for point-of-care ultrasound (POCUS) across ten public datasets spanning six anatomical sites (breast, thyroid, kidney, carotid, fetal, and small-animal tumor). We systematically validate both variants in ultrasound domains: UltraSeg-130K (0.13M parameters) achieves 89.7 FPS on single-core CPUs and 34.8 FPS on a refurbished mobile device, while UltraSeg-500K (0.5M parameters) delivers 44.6 FPS on CPU and 16.1 FPS on mobile device. UltraSeg-500K matches or exceeds the Dice performance of the 31M-parameter UNet and approaches 105M-parameter TransUNet in average performance, with superior zero-shot cross-dataset generalization on external validation sets (UDIAT, DDTI). By enabling clinical-grade segmentation without GPU dependency, this work brings AI costs in line with ultrasound accessibility, making advanced diagnostics available in resource-limited settings.

摘要：超聲影像因其低成本和可攜性而成為全球最廣泛採用的醫療模式，但人工智慧（AI）的應用仍受限於對GPU加速模型的依賴，形成了一種結構性悖論，即「智慧」的成本超過影像設備本身的成本。在此，我們展示了UltraSeg的系統性調整和廣泛評估，這是一種最初為結腸鏡息肉分割而開發的超輕量架構，現已針對十個公共數據集進行了針對即時超聲（POCUS）的工程設計，涵蓋六個解剖部位（乳腺、甲狀腺、腎臟、頸動脈、胎兒和小動物腫瘤）。我們系統性地驗證了超聲領域中的兩個變體：UltraSeg-130K（0.13M參數）在單核心CPU上達到89.7 FPS，在翻新移動設備上達到34.8 FPS，而UltraSeg-500K（0.5M參數）在CPU上提供44.6 FPS，在移動設備上提供16.1 FPS。UltraSeg-500K的Dice性能與31M參數的UNet相匹配或超過，並在平均性能上接近105M參數的TransUNet，並在外部驗證集（UDIAT，DDTI）上展現出優越的零樣本跨數據集泛化能力。通過實現無GPU依賴的臨床級分割，這項工作使AI成本與超聲可及性相符，使高級診斷在資源有限的環境中變得可用。

CottonLeafVision: An Explainable and Robust Deep Learning Framework for Cotton Leaf Disease Classification

2606.14686v1 by Rafi Ahamed, Md. Abir Rahman, Tasnia Tarannum Roza, Munaia Jannat Easha, Md. Asif Khan, Sudeepta Mandal

Globally, cotton is a highly economically beneficial crop, as the textile industry heavily depends on it. So, the precise identification and detection of cotton leaf disease is crucial for economic stability. The development goal of "CottonLeafVision" is to accurately classify and detect cotton leaf disease. With this goal, we have evaluated multiple pretrained Deep Convolutional Neural Networks, including DenseNet201, InceptionV3, and VGG19 on a publicly available cotton leaf disease image dataset. This image dataset includes seven classes, six disease classes, and one healthy class, collected under various field conditions reflecting real-world challenges. Among these pretrained models, with DenseNet201, we have achieved the highest classification accuracy of 98%. To enhance the model reliability and interpretability, we have implemented different techniques and methods such as Gradient-weighted Class Activation Mapping (Grad-CAM), occlusion sensitivity analysis and adversarial training to increase the noise resistance of the model. Finally, we have developed a prototype in order to utilize the model's capabilities on real life agriculture. This paper shows the deep learning model's capabilities to classify the disease in real-life cotton disease management situations.

摘要：全球而言，棉花是一種經濟效益極高的作物，因為紡織業在很大程度上依賴於它。因此，準確識別和檢測棉葉病對於經濟穩定至關重要。“CottonLeafVision”的發展目標是準確分類和檢測棉葉病。為了實現這一目標，我們評估了多個預訓練的深度卷積神經網絡，包括DenseNet201、InceptionV3和VGG19，這些網絡在一個公開可用的棉葉病影像數據集上進行了測試。這個影像數據集包括七個類別，其中六個是病害類別，一個是健康類別，這些數據是在各種田間條件下收集的，反映了現實世界的挑戰。在這些預訓練模型中，使用DenseNet201時，我們達到了98%的最高分類準確率。為了增強模型的可靠性和可解釋性，我們實施了不同的技術和方法，如梯度加權類別激活映射（Grad-CAM）、遮蔽敏感性分析和對抗訓練，以提高模型的抗噪聲能力。最後，我們開發了一個原型，以便在現實農業中利用模型的能力。本文展示了深度學習模型在現實棉花病害管理情境中分類疾病的能力。

A Definition of Good Explanations and the Challenges Explaining LLM Outputs

2606.14838v1 by Louis Mahon, Elliot Ford, Callum Hackett

How to define a good explanation is a long-standing philosophical debate which has found recent renewed interest in the context of AI outputs. Explainability is crucial for AI adoption in many contexts, but in order to produce good explanations of AI systems, we must first have an understanding of what good explanations are. In this paper we propose a definition inspired by the notion of counterfactual explanations, however we argue that one must also take into account the interlocutor's prior beliefs in each fact that could be offered in an explanation. We explore the ramifications of this definition for AI explainability and, in particular, why LLM outputs are difficult to produce good explanations for.

摘要：如何定義一個好的解釋是一個長期存在的哲學辯論，最近在人工智慧輸出方面重新引起了興趣。可解釋性對於許多情境中的人工智慧採用至關重要，但為了產生良好的人工智慧系統解釋，我們必須首先了解什麼是好的解釋。在本文中，我們提出了一個受反事實解釋概念啟發的定義，然而我們主張還必須考慮對話者在每個可以提供的解釋事實中的先前信念。我們探討了這一定義對人工智慧可解釋性的影響，特別是為什麼大型語言模型的輸出難以產生良好的解釋。

Listening with Attention: Entropy-Guided Explainability for Transformer-Based Audio Models

2606.14647v1 by Ravi Ranjan, Utkarsh Grover, Xiaomin Lin, Agoritsa Polyzou

Transformer-based automatic speech recognition (ASR) models such as Whisper are highly accurate, but their predictions remain difficult to interpret. Existing explainable AI (XAI) methods often lack faithfulness and precise temporal grounding. We propose Listening with Entropy-guided Attention for Faithful explainability (LEAF-X), a model-intrinsic XAI framework for transformer-based ASR. LEAF-X combines entropy-guided attention weighting, multi-layer attention rollout, and optional causal ablations to identify low-entropy, high-impact heads and layers, producing sparse token-to-frame attributions. Unlike perturbation-based explainers or raw attention maps, LEAF-X exploits the internal structure of encoder-decoder and speech-augmented decoder-only models to generate explanations that better reflect model computation. Results show 32% improved faithfulness, 35-39% stronger locality/sparsity, and the most stable attributions, supporting more transparent and auditable ASR.

摘要：Transformer 基於Transformer的自動語音識別 (ASR) 模型，例如 Whisper，具有很高的準確性，但其預測仍然難以解釋。現有的可解釋人工智慧 (XAI) 方法往往缺乏真實性和精確的時間基礎。我們提出了以熵引導注意力進行忠實解釋的聆聽模型 (LEAF-X)，這是一個針對基於Transformer的 ASR 的模型內在 XAI 框架。LEAF-X 結合了熵引導的注意力加權、多層注意力展開和可選的因果消融，以識別低熵、高影響力的頭部和層，產生稀疏的標記到幀的歸因。與基於擾動的解釋器或原始注意力圖不同，LEAF-X 利用編碼器-解碼器和增強語音的僅解碼器模型的內部結構來生成更好反映模型計算的解釋。結果顯示，忠實性提高了 32%，局部性/稀疏性增強了 35-39%，並且歸因最為穩定，支持更透明和可審計的 ASR。

Fodor and Pylyshyn's Systematicity Challenge Still Stands

2606.14512v1 by Michael Goodale, Salvador Mascarenhas

The recent successes of neural networks producing human-like language have caused significant stir in cognitive science, with many researchers arguing that classical puzzles about human cognition and challenges to artificial intelligence are being solved by neural networks. A notable case is the argument from systematicity due to Jerry Fodor and Zenon Pylyshyn, argues that humans display systematic biconditional dependencies. For example, someone can understand the sentence "John saw Mary" just in case that they understand the sentence "Mary saw John." Symbolic systems explain this systematicity of language and thought, while neural networks offer no immediate explanation. Several recent articles argue that this challenge has now been met by neural networks. In particular, Brenden Lake and Marco Baroni argue that their meta-learning for compositionality protocol matches and perhaps explains human systematicity. We demonstrate that these conclusions are premature. Among other results, we found that their model struggles to learn rules that are even slightly out of distribution compared to their training data. Furthermore, the model behaves unsystematically even on many within-distribution problems. We conclude that Fodor and Pylyshyn's challenge to neural networks remains unmet.

摘要：最近神經網絡在生成類似人類語言方面的成功引起了認知科學界的重大關注，許多研究者認為，關於人類認知的經典難題和對人工智慧的挑戰正被神經網絡解決。一個值得注意的案例是由Jerry Fodor和Zenon Pylyshyn提出的系統性論點，該論點認為人類表現出系統性的雙條件依賴性。例如，某人能理解句子「約翰看見瑪麗」，當且僅當他們理解句子「瑪麗看見約翰」。符號系統解釋了語言和思維的這種系統性，而神經網絡則沒有提供直接的解釋。幾篇最近的文章主張，這一挑戰現在已經被神經網絡所克服。特別是，Brenden Lake和Marco Baroni主張，他們的組合性元學習協議與人類的系統性相匹配，甚至可能解釋了人類的系統性。我們證明這些結論是為時已晚的。在其他結果中，我們發現他們的模型在學習與訓練數據相比稍微偏離分佈的規則時遇到了困難。此外，該模型在許多分佈內的問題上表現得不系統。我們得出結論，Fodor和Pylyshyn對神經網絡的挑戰仍未得到滿足。

Learning Urban Access Costs from Origin-Destination Flows via Inverse Optimal Transport

2606.14157v1 by Paula Joy B. Martinez

Cities deliver basic services through mixed public-private facility networks, including schools, clinics, transit providers, and subsidized service points. In these systems, planners often observe where households go, but not the latent cost function through which they trade off factors such as distance, price, and institutional access. We study this urban problem through school choice in the Philippines, where the country's largest national education subsidy is intended to redirect learners from congested public schools to participating private schools. Treating school-to-school enrollment flows as an entropic optimal transport plan, we recover latent choice costs using two complementary inverse optimal transport models: an interpretable distance-banded model with a subsidy term, and a neural cost model trained through a differentiable Sinkhorn forward pass. Applied to 283{,}016 learner trips across 23{,}820 observed flows in the most populated region, the framework estimates a subsidy-equivalent distance, $λ^{(k)}$, interpreted as the kilometers of perceived travel cost offset by the subsidy. The case demonstrates how administrative origin-destination data can be transformed into interpretable planning metrics for accessibility-aware subsidy design, facility siting, and urban service allocation.

摘要：城市透過混合的公私設施網絡提供基本服務，包括學校、診所、交通提供者和補貼服務點。在這些系統中，規劃者通常觀察家庭的去向，但並不瞭解它們在距離、價格和機構可及性等因素之間權衡的潛在成本函數。我們通過菲律賓的學校選擇研究這一城市問題，該國最大的國家教育補貼旨在將學習者從擁擠的公立學校引導到參與的私立學校。將學校之間的入學流視為一種熵最優運輸計劃，我們使用兩個互補的逆最優運輸模型恢復潛在的選擇成本：一個可解釋的距離帶模型，帶有補貼項，以及一個通過可微分的Sinkhorn前向傳遞訓練的神經成本模型。應用於最人口稠密地區的283,016次學習者出行，涵蓋23,820個觀察流，該框架估算出一個補貼等效距離$λ^{(k)}$，該距離被解釋為補貼抵消的感知旅行成本的公里數。這一案例展示了如何將行政來源-目的地數據轉化為可解釋的規劃指標，以便進行考慮可及性的補貼設計、設施選址和城市服務分配。

Recovering Stranded Discrimination in Knowledge Tracing: Per-Item Bias Correction via Empirical-Bayes Shrinkage

2606.14123v1 by Xiaoran Yan, Cheng Tang, Atsushi Shimada

Deployed knowledge-tracing models are typically frozen after training, yet systematic per-item logit bias arises, from limited per-item expressivity in backbone architectures and from post-deployment shifts in item properties, degrading prediction quality. Global post-hoc calibrators such as Platt scaling, temperature scaling, and isotonic regression improve probability estimates but leave discriminative ability, as measured by AUC, unchanged. This AUC invariance is a structural consequence of monotone score-only transforms; recovering the stranded discrimination requires conditioning on item identity. We propose SLC (State-space Logit Correction), which converts binary observations to Gaussian pseudo-observations via Laplace/IRLS, applies empirical-Bayes shrinkage through a Kalman smoother, and fits an offset-Platt link. The state-space formulation also yields a detectability bound that characterizes the Bernoulli information floor, explaining why temporal tracking provides no benefit at current data densities. Across four datasets, five backbones, and three seeds, SLC improves AUC on all four datasets and NLL on three, with the advantage concentrating on sparse items. Cross-domain controls suggest that the same phenomenon can arise beyond education when the deployed backbone leaves entity-level bias.

摘要：已部署的知識追蹤模型在訓練後通常會被凍結，但由於主幹架構中每個項目的表達能力有限，以及部署後項目屬性的變化，導致系統性的每項邏輯偏差，從而降低預測質量。全球事後校準器如 Platt 標定、溫度標定和等距回歸改善了概率估計，但對於 AUC 測量的區分能力則沒有改變。這種 AUC 不變性是單調分數轉換的結構性結果；恢復被孤立的區分能力需要對項目身份進行條件化。我們提出了 SLC（狀態空間邏輯校正），通過 Laplace/IRLS 將二元觀察轉換為高斯偽觀察，通過卡爾曼平滑器應用經驗貝葉斯收縮，並擬合偏移 Platt 連結。狀態空間的公式還產生了一個可檢測性界限，描述了伯努利信息底線，解釋了為什麼在當前數據密度下，時間追蹤沒有提供任何好處。在四個數據集、五個主幹和三個隨機種子中，SLC 在所有四個數據集上改善了 AUC，在三個數據集上改善了 NLL，且優勢集中在稀疏項目上。跨領域的控制表明，當部署的主幹留下實體級別的偏差時，這種現象也可能在教育之外出現。

How do Self-Supervised Remote Sensing Vision Models Transfer to Downstream Tasks?

2606.13896v1 by Julia Romero, Qin Lv, Morteza Karimzadeh

Self-supervised geospatial foundation models (GeoFMs) learn transferable representations from remote sensing data, but their downstream behavior is difficult to characterize. We study six representative GeoFMs spanning joint-embedding, reconstruction, and multimodal pretraining families, and evaluate transfer across classification, regression, and segmentation benchmarks under different label availability and downstream pipelines. We find that model rankings change across tasks and adaptation settings. Layerwise probing shows that, in most cases, task-relevant information is more accessible in intermediate transformer blocks compared to final-layer embeddings, and that GeoFMs exhibit distinct depthwise profiles. In segmentation case studies on PASTIS and Sen1Floods11, downstream adaptation settings such as decoder design and fine-tuning can be as impactful as the choice of GeoFM, and standard dense-prediction heads may be poorly aligned with how GeoFMs organize information over depth. Finally, CKA analysis on case studies shows that fine-tuning does not rewrite GeoFMs uniformly across depth, and the strongest changes are localized to the first linear layer of the MLP in ViT blocks. These results help explain why GeoFM rankings shift across benchmarks and motivate more representation-aware evaluation and adaptation strategies.

摘要：自我監督的地理空間基礎模型（GeoFMs）從遙感數據中學習可轉移的表示，但其下游行為難以特徵化。我們研究了六個代表性的GeoFMs，涵蓋了聯合嵌入、重建和多模態預訓練類別，並在不同標籤可用性和下游管道下評估分類、回歸和分割基準的轉移。我們發現模型排名在不同任務和適應設置中有所變化。逐層探測顯示，在大多數情況下，與任務相關的信息在中間的Transformer塊中比在最終層嵌入中更易於獲取，並且GeoFMs表現出明顯的深度特徵。在PASTIS和Sen1Floods11的分割案例研究中，下游適應設置如解碼器設計和微調可能與GeoFM的選擇一樣具有影響力，而標準的密集預測頭可能與GeoFMs在深度上組織信息的方式不太一致。最後，對案例研究的CKA分析顯示，微調並不會在深度上均勻地重寫GeoFMs，最強的變化集中在ViT塊的第一個線性層。這些結果有助於解釋為什麼GeoFM的排名在基準之間會變化，並促使更具表示意識的評估和適應策略。

Explaining RhythmFormer: A Systematic XAI Analysis of Periodic Sparse Attention for Remote Photoplethysmography

2606.13839v1 by Louis Chen, Torbjörn E. M. Nordling

Remote photoplethysmography (rPPG) transformers achieve low heart-rate error on benchmarks, yet their decisions remain opaque--a growing concern as rPPG moves toward clinical heart rate estimation. Existing rPPG XAI is dominated by qualitative heatmap inspection without quantitative faithfulness metrics or physiology-grounded validation, leaving a gap between visual plausibility and auditable evidence. We address this gap. First, we adapt four attribution methods (raw attention, rollout, flow, Beyond Intuition) to RhythmFormer's bi-level routing attention with top-$k$ selection. Second, we introduce a skin coverage metric quantifying how much attribution mass falls on skin regions. Third, we adapt the SaCo faithfulness coefficient from its original classification setting to rPPG regression by using the MAE between original and perturbed predicted rPPG waveforms as the perturbation impact. Applying these tools, we quantify a multi-hop leakage effect under sparse top-$k$ routing: attention rollout and flow almost completely restores the connections that individual refined-attention layers explicitly set to zero. Beyond Intuition mitigates this via its value-projection-weighted rollout and gradient-supported mask, attaining the highest median refined skin coverage ($0.83$ vs. $0.57$ for vanilla rollout) and faithfulness ($F=0.92$) among the evaluated methods on UBFC-rPPG. Validation across diverse datasets and model variants is needed. A case study on a low-SaCo outlier further shows all four methods recovering consistently once an artefactual region is replaced, suggesting consistent SaCo behavior across attribution families in this illustrative case. Together, these metrics move XAI for rPPG toward auditable numerical evidence about spatial alignment and perturbation faithfulness, i.e. trustworthy rPPG XAI.

摘要：遠程光電容積描記法（rPPG）Transformer在基準測試中實現了低心率誤差，但它們的決策仍然不透明——隨著rPPG向臨床心率估計的發展，這成為一個日益關注的問題。現有的rPPG可解釋人工智慧（XAI）主要依賴於定性的熱圖檢查，缺乏定量的真實性指標或基於生理學的驗證，這使得視覺上的合理性與可審計的證據之間存在差距。我們針對這一差距進行了研究。首先，我們將四種歸因方法（原始注意力、展開、流動、超越直覺）適應於RhythmFormer的雙層路由注意力，並進行top-$k$選擇。其次，我們引入了一種皮膚覆蓋度指標，量化有多少歸因質量落在皮膚區域上。第三，我們將SaCo真實性係數從其原始分類設置調整為rPPG回歸，通過使用原始和擾動預測的rPPG波形之間的平均絕對誤差（MAE）作為擾動影響。應用這些工具，我們量化了在稀疏top-$k$路由下的多跳洩漏效應：注意力展開和流動幾乎完全恢復了個別精煉注意力層明確設置為零的連接。超越直覺通過其值投影加權的展開和梯度支持的掩碼來減輕這一問題，在評估的UBFC-rPPG方法中達到了最高的中位數精煉皮膚覆蓋度（$0.83$對比$0.57$的普通展開）和真實性（$F=0.92$）。需要在多樣化數據集和模型變體上進行驗證。一項關於低SaCo異常值的案例研究進一步顯示，一旦替換了人為產生的區域，所有四種方法都能一致恢復，這表明在這一示例案例中，歸因家族之間的SaCo行為是一致的。總體而言，這些指標使rPPG的XAI朝著可審計的數字證據邁進，關於空間對齊和擾動真實性，即值得信賴的rPPG XAI。

ArogyaSutra: A Multi-Agent Framework for Multimodal Medical Reasoning in Indic Languages

2606.13572v1 by Tanmoy Kanti Halder, Akash Ghosh, Subhadip Baidya, Arijit Roy, Sriparna Saha

Multimodal Large Language Models (MLLMs) have shown promising reasoning capabilities in general domains, yet their performance remains limited in specialized settings such as healthcare, especially in multilingual and low-resource scenarios. This gap is critical in regions like rural India, where patients often express complex medical queries in native Indic languages and rely on multimodal inputs such as medical images. Existing English-centric MLLMs struggle to support such use cases, limiting equitable access to AI-driven healthcare assistance. To address this challenge, we introduce ArogyaBodha, a large-scale multilingual multimodal medical question-answer dataset constructed from eight heterogeneous sources, covering 31 body systems, six imaging modalities, and 21 clinical domains across English and seven major Indian languages. We further propose ArogyaSutra, an actor-critic-based multi-agent framework that integrates tool grounding with dual-memory mechanisms for step-wise, reasoning-aware decision making, and uses stored actor-critic simulation trajectories for distillation. Experiments show that our dataset and framework improve multilingual medical reasoning accuracy across all Indic languages, with ablations validating the contribution of each component. The source code and dataset are available at: https://iitp-cse.github.io/ ArogyaSutra/

摘要：多模態大型語言模型（MLLMs）在一般領域中顯示出有希望的推理能力，但在專業環境中，如醫療保健，尤其是在多語言和低資源的情境下，其表現仍然有限。這一差距在像印度農村這樣的地區尤為重要，患者經常用母語印地語表達複雜的醫療問題，並依賴醫療影像等多模態輸入。現有以英語為中心的 MLLMs 難以支持這類用例，限制了公平獲得 AI 驅動的醫療協助。為了解決這一挑戰，我們介紹了 ArogyaBodha，這是一個大型多語言多模態醫療問答數據集，該數據集由八個異質來源構建，涵蓋 31 個身體系統、六種影像模態和 21 個臨床領域，涉及英語和七種主要的印度語言。我們進一步提出了 ArogyaSutra，一個基於演員-評論家的多代理框架，將工具基礎與雙重記憶機制整合，用於逐步的、具推理意識的決策，並利用存儲的演員-評論家模擬軌跡進行蒸餾。實驗表明，我們的數據集和框架提高了所有印地語言的多語言醫療推理準確性，消融實驗驗證了每個組件的貢獻。源代碼和數據集可在以下位置獲得：https://iitp-cse.github.io/ ArogyaSutra/

Is It You or Your Environment? A Bayesian Inference Framework for Genomically-Anchored Personalized Physiological Interpretation

2606.13556v2 by Aruna Dey, Suraj Biswas

Personalized health AI systems face a fundamental cold-start problem: machine learning models for physiological interpretation require weeks of individual behavioral data before they can distinguish constitutional variation from environmentally driven deviation. We propose a solution grounded in causal inference and Bayesian prior design. An individual's genomic profile serves as an exogenous genetic anchor -- a domain-informed, personalized prior that is fixed at conception, immune to reverse causation, and available before a single behavioral observation is collected. The anchor initializes a Bayesian belief state over an individual's physiological set point G-hat = mu + sum(beta_i * g_i), where beta_i are GWAS-derived effect sizes and g_i are risk-allele counts. Each incoming physiological measurement P produces a non-constitutional deviation delta = P - G-hat that separates the signal attributable to environment and state from the constitutionally fixed baseline. As behavioral data accrue, the prior decays according to G-hat_t = w(t)G-hat_genomic + [1-w(t)]P-bar_t, transitioning from genome-dominated to empirical-baseline-dominated inference. The same observed HRV of 55 ms generates a suppression hypothesis for a person whose prior predicts 80 ms, and an enhancement hypothesis for a person whose prior predicts 30 ms -- a reversal impossible without a personalized anchor. We develop this architecture across six physiological domains, grading genomic priors by evidence strength, distinguishing robustly replicated anchors (FTO, FADS1/2, FKBP5) from contested candidate genes (SLC6A4, MAOA, DRD2). We address the inference boundary between association, Mendelian randomization, and individual token causation, and define four constraints for deployment: evidence-graded priors, dynamic decay, ancestry-matched effect sizes, and attribution rather than deterministic output.

摘要：個性化健康人工智慧系統面臨一個根本的冷啟動問題：生理解釋的機器學習模型需要數週的個體行為數據，才能區分憲法變異與環境驅動的偏差。我們提出了一個基於因果推斷和貝葉斯先驗設計的解決方案。個體的基因組特徵作為外生的遺傳錨點——一個基於領域知識的個性化先驗，固定於受孕時，不受逆因果影響，並且在收集到任何行為觀察之前就已可用。這個錨點初始化了一個貝葉斯信念狀態，該狀態關於個體的生理設置點 G-hat = mu + sum(beta_i * g_i)，其中 beta_i 是 GWAS 衍生的效應大小，g_i 是風險等位基因數量。每個進來的生理測量 P 產生一個非憲法偏差 delta = P - G-hat，這將環境和狀態所造成的信號與憲法固定的基線分開。隨著行為數據的累積，先驗根據 G-hat_t = w(t)G-hat_genomic + [1-w(t)]P-bar_t 衰減，從基因組主導的推斷過渡到經驗基線主導的推斷。同樣觀察到的 HRV 為 55 毫秒，對於一個其先驗預測 80 毫秒的人，產生了一個抑制假設，而對於一個其先驗預測 30 毫秒的人，則產生了一個增強假設——這種反轉在沒有個性化錨點的情況下是不可能的。我們在六個生理領域內發展這一架構，根據證據強度對基因組先驗進行分級，明確區分穩健重複的錨點（FTO、FADS1/2、FKBP5）與有爭議的候選基因（SLC6A4、MAOA、DRD2）。我們解決了關聯、孟德爾隨機化和個體標記因果之間的推斷邊界，並定義了四個部署約束：證據分級的先驗、動態衰減、祖先匹配的效應大小，以及歸因而非確定性輸出。

Physics-Guided Spatiotemporal Learning for Coastal Wave Peak Period Estimation from Video

2606.13302v1 by Abubakar Hamisu Kamagata, Dharm Singh Jat, Attlee Munyaradzi Gamundani, Abhishek Srivastava, Paramasivam Saravanakumar

Wave parameters in the nearshore are crucial for coastal engineering, shoreline protection, marine hazard assessment, and coastal management for climate resilience. Traditional monitoring systems like buoys and radar platforms offer accurate monitoring but can have high installation and maintenance expenses and limited spatial coverage. Passive ocean monitoring using video has been achieved by leveraging deep learning, however, many methods are not physically interpretable, feasible, and validated for oceanography. In thiswork, a Physics-Guided Deep Spatiotemporal Learning Framework for direct estimation of nearshore wave peak periods from passive coastal video stream is proposed. The framework combines automated temporal-variance based region-of-interest detection, multi-stage Sim-to-Real transfer learning, and physics-informed regularization to enhance the predictive accuracy and physical consistency. A variety of spatiotemporal architectures were assessed, such as transformer-based and recurrent-convolutional ones, alongside synthetic pretraining,silver-label adaptation, and expert fine-tuning. The results show that transformer-based architectures outperformed in terms of the accuracy of the instantaneous prediction, while lightweight recurrent-convolutional architectures achieved higher temporal stability and operational oceanographic skill. Ablation studies also demonstrated the benefits of physics-guided regularization in terms of trend-following consistency, and physically implausible predictions. Explainability auditing also helped to focus attention in hydrodynamically active surf-zone regions and showed good agreement with the physically derived wave propagation behavior. In general, the proposed framework shows the promise of physics-guided video-based deep learning systems for long-term coastal wave monitoring that are cost-efficient and operationally feasible.

摘要：近岸的波浪參數對於海岸工程、海岸線保護、海洋危害評估以及氣候韌性的海岸管理至關重要。傳統的監測系統如浮標和雷達平台提供準確的監測，但可能會有高昂的安裝和維護費用以及有限的空間覆蓋範圍。利用深度學習實現的被動海洋監測已經取得進展，然而，許多方法在物理上不可解釋、不可行，且未經海洋學驗證。在本研究中，提出了一種物理引導的深度時空學習框架，用於從被動海岸視頻流中直接估算近岸波峰周期。該框架結合了基於自動時間變異的區域興趣檢測、多階段的模擬到實際轉移學習以及物理知識引導的正則化，以提高預測準確性和物理一致性。評估了多種時空架構，如基於Transformer的和循環卷積的架構，以及合成預訓練、銀標適應和專家微調。結果顯示，基於Transformer的架構在瞬時預測的準確性方面表現優於其他架構，而輕量級的循環卷積架構則實現了更高的時間穩定性和操作海洋學技能。消融研究還顯示了物理引導正則化在趨勢跟隨一致性和物理上不合理預測方面的好處。可解釋性審計還有助於將注意力集中在水動力活躍的衝浪區域，並顯示出與物理推導的波浪傳播行為良好的一致性。總的來說，所提出的框架顯示了物理引導的基於視頻的深度學習系統在長期海岸波浪監測中的潛力，這些系統具有成本效益和操作可行性。

Hallucination in Medical Imaging AI: A Cross-Modality Analytical Framework for Taxonomy, Detection, and Mitigation under Regulatory Constraints

2606.13211v1 by Omar Alshahrani, Muzammil Behzad

AI systems are being deployed across medical imaging faster than their failure modes are understood. At this point in time, the failure of greatest clinical concern is hallucination: clinically plausible but factually incorrect outputs, including fabricated anatomical structures, missed findings, incorrect laterality, and invented measurements in generated reports, with direct consequences, for example, for biopsy decisions, staging, and treatment planning. This structured narrative synthesizes peer-reviewed studies, benchmark datasets, and FDA regulatory guidance across five imaging modalities to produce a cross-modality analysis of hallucination taxonomy, etiology, detection, and mitigation. Specifically, we address three questions in this study: (1) how can existing taxonomies be unified across modalities?, (2) how do medical-specialized foundation models hallucinate less than general-purpose ones?, and (3) which mitigation strategies are effective and compatible with FDA lifecycle oversight? We note that three taxonomic frameworks together cover the imaging pipeline in a way no single framework does alone. We also highlight that general-purpose foundation models outperform medical-specialized models on hallucination-specific benchmarks, indicating that narrow domain fine-tuning can introduce overfitting-induced confabulation. At the same time, the oversight of radiologists remains essential; for instance, a very high percentage of of AI-generated flags required expert correction before clinical use. Physics-informed architectural constraints, Chain-of-Thought prompting, and human-in-the-loop safeguards each address different failure modes and is effective when combined. All findings are mapped to the FDA's Total Product Lifecycle and Predetermined Change Control Plan frameworks, which treat hallucination management as a lifecycle obligation rather than a pre-deployment checklist.

摘要：AI 系統在醫學影像領域的部署速度超過了對其失效模式的理解。此時，最大的臨床關注點是幻覺：臨床上看似合理但事實上不正確的輸出，包括虛構的解剖結構、漏診、錯誤的側別以及生成報告中的虛構測量，這些都會直接影響，例如，活檢決策、分期和治療計劃。這篇結構化敘述綜合了同行評審的研究、基準數據集和 FDA 的監管指導，涵蓋五種影像模式，以產生對幻覺分類、病因、檢測和緩解的跨模式分析。具體而言，我們在這項研究中解決三個問題：(1) 如何統一現有的分類法？(2) 醫學專用的基礎模型如何比通用模型產生更少的幻覺？以及 (3) 哪些緩解策略是有效的並且與 FDA 生命週期監管相容？我們注意到，三個分類框架共同覆蓋了影像流程，而單一框架無法單獨做到這一點。我們還強調，通用基礎模型在針對幻覺的基準測試中表現優於醫學專用模型，這表明狹窄領域的微調可能會導致過擬合引起的虛構。同時，放射科醫生的監督仍然至關重要；例如，極高比例的 AI 生成標記在臨床使用前需要專家修正。基於物理的架構約束、思考鏈提示以及人機協作的安全措施各自針對不同的失效模式，並且在結合使用時效果良好。所有發現都映射到 FDA 的總產品生命週期和預定變更控制計劃框架，將幻覺管理視為生命週期的責任，而非部署前的檢查清單。

Cascade Classification of Dermoscopic Images of Skin Neoplasms with Controllable Sensitivity and External Clinical Validation

2606.13135v1 by Elena S. Kozachok, Sergey S. Seregin, Aleksandr V. Kozachok, Ilya P. Latyshev, Oleg I. Samovarov

Purpose. To compare deep learning architectures and classification schemes for dermoscopic images of skin neoplasms and assess their generalization on transfer from open international datasets to independent clinical datasets of Russian practice. Methods. Four architectures (ViT-B/16, Swin-S, ConvNeXt-S, EfficientNetV2-S) were compared in three schemes: binary (malignant/benign), single-stage four-class (benign, MEL, SCC, BCC), and a two-stage cascade (binary triage, then three-class differentiation MEL/SCC/BCC). All models used ImageNet-pretrained weights and a single augmentation protocol on aggregated open ISIC Archive data, and were evaluated on an internal held-out sample and two clinical datasets (Melanoscope AI mobile system; Sechenov University). Results. Internally the binary stage attains ROC-AUC 0.952-0.966; on Sechenov University it drops to 0.797-0.893, sensitivity to 0.53-0.67, and ECE rises from 0.02 to 0.27-0.39 with underestimation of malignancy, quantifying a generalization gap in ranking and calibration. Paired tests confirm one inter-architecture result on clinical data: the deficit of ViT-B/16 at the binary stage (p<0.05); at the differentiation stage no architecture has a proven advantage. The cascade raises macro F1 over single-stage four-class classification for most architectures, but significantly only for ViT-B/16, by recovering malignant lesions assigned to the dominant benign class. On ISIC MILK10k, direct 11-class classification yields mean-class sensitivity 0.525. Conclusion. A tunable triage threshold gives sensitivity control not attainable in standard single-stage (argmax) classification and better reproduces clinical differential-diagnosis logic. The persistent generalization gap mandates external clinical validation and recalibration before deployment.

摘要：目的。比較深度學習架構和皮膚腫瘤的皮膚鏡圖像分類方案，並評估其從開放國際數據集轉移到俄羅斯臨床實踐獨立數據集的泛化能力。方法。比較了四種架構（ViT-B/16、Swin-S、ConvNeXt-S、EfficientNetV2-S）在三種方案中的表現：二元（惡性/良性）、單階段四類（良性、MEL、SCC、BCC），以及兩階段級聯（二元篩選，然後三類區分MEL/SCC/BCC）。所有模型均使用ImageNet預訓練權重和單一增強協議，基於聚合的開放ISIC檔案數據進行訓練，並在內部保留樣本和兩個臨床數據集（Melanoscope AI移動系統；塞琴諾夫大學）上進行評估。結果。在內部，二元階段達到ROC-AUC 0.952-0.966；在塞琴諾夫大學下降至0.797-0.893，靈敏度降至0.53-0.67，ECE從0.02上升至0.27-0.39，並低估了惡性腫瘤，量化了排名和校準中的泛化差距。配對測試確認了一個臨床數據上的架構間結果：ViT-B/16在二元階段的不足（p<0.05）；在區分階段，沒有架構具有明顯優勢。級聯方法在大多數架構中提高了宏觀F1分數，超過單階段四類分類，但對於ViT-B/16的提升顯著，因為它恢復了被分配到主導良性類別的惡性病變。在ISIC MILK10k上，直接的11類分類產生的平均類別靈敏度為0.525。結論。可調的篩選閾值提供了在標準單階段（argmax）分類中無法實現的靈敏度控制，並更好地重現臨床鑑別診斷邏輯。持續的泛化差距要求在部署之前進行外部臨床驗證和重新校準。

Zero-source LLM Hallucination Detection with Human-like Criteria Probing

2606.12900v1 by Jiahao Yang, Shuhai Zhang, Hailong Kang, Feng Liu, Qi Chen, Mingkui Tan

Large language models (LLMs) often hallucinate by generating factually incorrect or unfaithful content, posing significant risks to their safe use. Detecting such hallucinations is particularly challenging under the zero-source constraint, where no model internals or external references are available, and detection must rely solely on the textual query-answer pair. In this paper, we propose Human-like Criteria Probing for Hallucination Detection (HCPD), a paradigm that emulates the multi-faceted reasoning of human evaluators. Its core is a Human-like Criteria Probing (HCP) mechanism, in which a LLM agent adaptively decomposes its judgment into a weighted set of interpretable criteria and aggregates criterion-specific scores into a final truthfulness measure. To achieve this adaptive capability, we introduce a reward-based alignment scheme using only weak supervision from semantic consistency. At inference, we employ a multi-sampling aggregation strategy to ensure robust decisions while preserving full interpretability. We further provide theoretical analysis supporting the reliability of our approach. Extensive experiments show that HCPD consistently outperforms state-of-the-art baselines, offering an effective and explainable solution for zero-source hallucination detection. Code is available at https://github.com/TRISKEL10N/HCPD.

摘要：大型語言模型（LLMs）經常會產生事實上不正確或不忠實的內容，這對其安全使用構成了重大風險。在零來源約束下檢測這種幻覺尤其具有挑戰性，因為沒有模型內部或外部參考可用，檢測必須僅依賴文本查詢-回答對。在本文中，我們提出了人類類似標準探測幻覺檢測（HCPD），這是一種模擬人類評估者多面向推理的範式。其核心是一個人類類似標準探測（HCP）機制，在此機制中，LLM代理根據加權的可解釋標準自適應地分解其判斷，並將標準特定的分數匯總為最終的真實性度量。為了實現這種自適應能力，我們引入了一種基於獎勵的對齊方案，僅使用來自語義一致性的弱監督。在推理過程中，我們採用多重採樣聚合策略，以確保穩健的決策，同時保持完全的可解釋性。我們進一步提供理論分析以支持我們方法的可靠性。大量實驗表明，HCPD始終優於最先進的基準，提供了一種有效且可解釋的零來源幻覺檢測解決方案。代碼可在 https://github.com/TRISKEL10N/HCPD 獲得。

PolicyGuard: Towards Test-time and Step-level Adversary Defense for Reinforcement Learning Agent

2606.12896v1 by Junfeng Guo Heng Huang

While real-world applications of reinforcement learning (RL) are becoming increasingly popular, the security of RL systems deserve more attention and exploration. In particular, recent work has revealed that RL agents are vulnerable to backdoor attacks, where a victim agent behaves normally under standard conditions but executes malicious actions when a specific trigger is activated. Existing backdoor defenses for RL either require access to the agent's internal parameters, operate only at the model or trajectory level, or are limited to specific attack types. To ensure the security of RL agents, we propose \texttt{PolicyGuard}, a \textit{test-time step-level} backdoor defense which leverages Gaussian Process (GP) posterior variance and adapts pseudo trajectories to enable uncertainty computation for individual time step. Besides, we also provide theoretical foundations to explain the efficacy of GP posterior variance. Extensive experiments across seven RL games demonstrate that PolicyGuard achieves state-of-the-art detection performance in most cases, with average AUROC of 0.856 for perturbation-based attacks and 0.859 for adversary-agent attacks.

摘要：雖然強化學習（RL）的實際應用越來越受歡迎，但RL系統的安全性卻值得更多的關注和探索。特別是，最近的研究顯示，RL代理容易受到後門攻擊的影響，受害者代理在標準條件下表現正常，但在特定觸發器被激活時會執行惡意行為。現有的RL後門防禦要麼需要訪問代理的內部參數，要麼僅在模型或軌跡層面運作，或僅限於特定的攻擊類型。為了確保RL代理的安全性，我們提出了\texttt{PolicyGuard}，這是一種\textit{測試時步驟級別}的後門防禦，利用高斯過程（GP）後驗方差並調整偽軌跡，以便對個別時間步進行不確定性計算。此外，我們還提供了理論基礎來解釋GP後驗方差的有效性。在七個RL遊戲中的大量實驗表明，PolicyGuard在大多數情況下達到了最先進的檢測性能，對於基於擾動的攻擊平均AUROC為0.856，對於對抗代理攻擊則為0.859。

Acquisition state behaves as a structured, measurable variable governing lung-nodule AI: kernel-driven measurement instability and noise-driven detection fragility, invisible to DICOM metadata

2606.12824v1 by Daniel Soliman

AI governance for medical imaging is formalizing: the 2026 ACR-SIIM Practice Parameter recommends local acceptance testing and ongoing drift monitoring, and the ACR Assess-AI registry monitors AI outputs using DICOM metadata for context. We argue that a necessary, currently unmonitored layer sits beneath output metrics: whether incoming studies remain within the acquisition envelope a model was validated on. Using a LUNA16-trained MONAI RetinaNet lung-nodule detector, we test whether acquisition state behaves as a structured, measurable variable. On real paired CT differing only in reconstruction kernel (NLST B30f vs B80f), kernel alone shifted AI-measured diameter and flipped a Fleischner size category in 5.2% (8 of 155) of nodules at fixed patient and acquisition, while detection confidence was unchanged (Wilcoxon p=0.22). Under controlled LIDC-IDRI perturbations the effects dissociated by axis: the noise axis degraded detection confidence (p=5.9e-32, concentrated in nodules under 6 mm) but not measurement, while the frequency/kernel axis corrupted measurement (p=8.6e-13) but not detection. A 4-feature pixel fingerprint recovered reconstruction identity (patient-level AUC about 0.95 on real CT, 0.995 on a QIBA phantom) where the ConvolutionKernel DICOM tag was uninformative (identical labels across reconstructions). The kernel axis transported across four manufacturers (leave-one-vendor-out AUC 0.94-0.98, matching the within-vendor ceiling). Acquisition state thus maps to distinct AI failure modes, frequency content to measurement reliability and noise to detection sensitivity, and is not recoverable from metadata. Acquisition-aware, input-side validation is the missing layer for the acceptance-testing and drift-monitoring requirements now entering imaging-AI accreditation.

摘要：AI 醫療影像治理正在正式化：2026 年 ACR-SIIM 實踐參數建議進行本地接受測試和持續的漂移監控，而 ACR Assess-AI 註冊處則利用 DICOM 元數據監控 AI 輸出以提供上下文。我們主張，在輸出指標之下，存在一個必要的、目前未被監控的層面：即進來的研究是否仍然在模型驗證時的獲取範圍內。使用 LUNA16 訓練的 MONAI RetinaNet 肺結節檢測器，我們測試獲取狀態是否作為一個結構化的、可測量的變量。在僅在重建核上有所不同的真實配對 CT（NLST B30f 與 B80f）上，僅核便改變了 AI 測量的直徑，並在固定的患者和獲取下將 5.2%（8/155）的結節的 Fleischner 大小類別翻轉，而檢測信心則保持不變（Wilcoxon p=0.22）。在受控的 LIDC-IDRI 擾動下，影響按軸分離：噪聲軸降低了檢測信心（p=5.9e-32，集中在小於 6 mm 的結節上），但未影響測量，而頻率/核軸則損壞了測量（p=8.6e-13），但未影響檢測。一個 4 特徵像素指紋恢復了重建身份（在真實 CT 上的患者級 AUC 約為 0.95，在 QIBA 幻影上為 0.995），而 ConvolutionKernel DICOM 標籤則無法提供有用信息（在重建中標籤相同）。因此，核軸在四個製造商之間傳輸（去除一個供應商的 AUC 為 0.94-0.98，與供應商內的上限相匹配）。獲取狀態因此映射到不同的 AI 失效模式，頻率內容映射到測量可靠性，噪聲映射到檢測敏感性，且無法從元數據中恢復。具備獲取意識的輸入端驗證是當前進入影像 AI 認證的接受測試和漂移監控要求中缺失的層面。

LLM-Powered Personalized Glycemic Assessment in Type 2 Diabetes with Wearable Sensor Data

2606.12699v1 by Yifan Gao, Yanmin Gong, Yun Shi, Yuanxiong Guo

Type 2 Diabetes (T2D) poses an increasing global health threat, demanding effective glycemic assessment to support personalized and improved diabetes care. Wearable sensors such as continuous glucose monitors (CGM) and fitness trackers offer many valuable insights for glycemic assessment. However, effectively analyzing these data requires integration with essential individual-level context. Existing methods are often based on traditional machine learning (ML) and rely primarily on historical blood glucose measurements and overlook personalized information, which limits their performance across diverse diabetes populations. Recent advances in large language models (LLMs) have demonstrated their ability to integrate diverse data modalities while modeling sequential dependencies, motivating the exploration of their potential for personalized glycemic assessment. In this paper, we propose GlyLLM, an LLM-powered framework for modeling CGM-based glycemic dynamics through the integration of wearable sensor data and structured metadata. GlyLLM can leverage the extensive prior knowledge of pre-trained LLMs and achieve sensor-text semantic abstraction at decision time. Experiments on two related tasks on the AI-READI dataset demonstrate that our model outperforms traditional ML methods by an average of 13.66\% in Root Mean Squared Error (RMSE) for glucose forecasting and 13.08\% in Area Under the Receiver Operating Characteristic (AUROC) for diabetes categorization. Additionally, our ablation study shows that diabetes surveys and biometric tests are more critical than other health information for glycemic assessment. Our work presents a promising step toward harnessing the power of LLMs to advance personalized glycemic assessment in T2D care.

摘要：2型糖尿病（T2D）對全球健康構成日益嚴重的威脅，迫切需要有效的血糖評估以支持個性化和改善的糖尿病護理。可穿戴傳感器，如持續血糖監測儀（CGM）和健身追蹤器，為血糖評估提供了許多有價值的見解。然而，有效分析這些數據需要與重要的個體層面背景整合。現有的方法通常基於傳統的機器學習（ML），主要依賴歷史血糖測量，並忽略個性化信息，這限制了它們在多樣化糖尿病人群中的表現。最近在大型語言模型（LLMs）方面的進展已顯示出它們能夠整合多樣的數據模態，同時建模序列依賴性，這激勵我們探索它們在個性化血糖評估中的潛力。在本文中，我們提出了GlyLLM，一個基於LLM的框架，用於通過整合可穿戴傳感器數據和結構化元數據來建模基於CGM的血糖動態。GlyLLM可以利用預訓練LLMs的廣泛先驗知識，並在決策時實現傳感器-文本語義抽象。在AI-READI數據集上的兩個相關任務實驗表明，我們的模型在血糖預測中比傳統的ML方法平均提高了13.66\%的均方根誤差（RMSE），在糖尿病分類中提高了13.08\%的接收者操作特徵曲線下的面積（AUROC）。此外，我們的消融研究顯示，糖尿病調查和生物識別測試對於血糖評估比其他健康信息更為關鍵。我們的工作為利用LLMs的力量推進T2D護理中的個性化血糖評估邁出了有希望的一步。

Atlas H&E-TME: Scalable AI-Based Tissue Profiling at Expert Pathologist-Level Accuracy

2606.12346v1 by Kai Standvoss, Miriam Hägele, Rosemarie Krupar, Julika Ribbat-Idel, Jennifer Altschüler, Gerrit Erdmann, Hans Pinckaers, Evelyn Ramberger, Madleen Drinkwitz, Ádám Nárai, Alexander Möllers, Katja Lingelbach, Sebastian Kons, Lukas Hönig, Recepcan Adigüzel, Joana Baião, Alberto Megina Gonzalo, Marius Teodorescu, Marie-Lisa Eich, Paolo Chetta, Shakil Merchant, Verena Aumiller, Simon Schallenberg, Andrew Norgan, Klaus-Robert Müller, Lukas Ruff, Maximilian Alber, Frederick Klauschen

Hematoxylin and eosin (H&E) staining is the cornerstone of histopathology, yet scalable, quantitative analysis of H&E whole-slide images (WSIs) remains a central challenge in computational pathology. We present Atlas H&E-TME, an AI-based system built on the Atlas family of pathology foundation models that predicts tissue quality, tissue region, and cell type labels across multiple cancer types, yielding over 4,500 quantitative readouts per slide at cell-level resolution. A key challenge to validating such systems is overcoming morphological ambiguity inherent to H&E-only ground truth and the limited scalability of more informed references drawing on modalities such as immunohistochemistry (IHC). We address this with a dual validation framework combining biologically grounded depth with technical and morphological breadth. For depth, we propose an IHC-informed multi-pathologist consensus protocol that substantially improves inter-rater agreement over conventional H&E-only annotation. This yields a molecularly grounded reference against which we compare Atlas H&E-TME and pathologists working from H&E alone. For breadth, we benchmark Atlas H&E-TME on over 200,000 high-confidence H&E-only pathologist annotations across 1,500+ cases spanning eight cancer types and their most common metastatic sites, with subtypes covering >90% of clinical cases per cancer type, drawn from 25+ sources and 8+ scanner models. Benchmarked against the IHC-informed consensus, Atlas H&E-TME matches or exceeds pathologist H&E-only performance and generalizes consistently and robustly across this broad morphological and technical scope. In doing so, Atlas H&E-TME turns the H&E slide -- the most ubiquitous data in pathology -- into a scalable, quantitative window into the tumor and its microenvironment, laying a foundation for the next generation of tissue-based biomarkers in translational and clinical research.

摘要：Hematoxylin 和 eosin (H&E) 染色是組織病理學的基石，但 H&E 全片影像 (WSIs) 的可擴展、定量分析仍然是計算病理學中的一個主要挑戰。我們提出了 Atlas H&E-TME，一個基於 AI 的系統，建立在 Atlas 病理學基礎模型家族上，能夠預測多種癌症類型的組織質量、組織區域和細胞類型標籤，每張幻燈片提供超過 4,500 個細胞級別解析的定量讀數。驗證此類系統的一個主要挑戰是克服 H&E 僅有的真實標準中固有的形態學模糊性，以及基於免疫組織化學 (IHC) 等模式的更具信息性的參考的有限可擴展性。我們通過一個雙重驗證框架來解決這個問題，結合了生物學上扎實的深度與技術和形態學的廣度。在深度方面，我們提出了一個 IHC 資訊的多病理學家共識協議，顯著提高了與傳統 H&E 僅有標註相比的評估者間一致性。這提供了一個分子基礎的參考，與我們比較 Atlas H&E-TME 和僅使用 H&E 的病理學家。在廣度方面，我們在超過 200,000 個高信心的 H&E 僅有病理學家標註上對 Atlas H&E-TME 進行基準測試，這些標註來自 1,500 多個案例，涵蓋八種癌症類型及其最常見的轉移部位，亞型覆蓋每種癌症類型超過 90% 的臨床案例，來自 25 多個來源和 8 種以上的掃描儀模型。與 IHC 資訊的共識進行基準測試後，Atlas H&E-TME 的表現與病理學家的 H&E 僅有表現相匹配或超過，並在這個廣泛的形態學和技術範疇內持續且穩健地進行泛化。通過這樣做，Atlas H&E-TME 將 H&E 幻燈片——病理學中最普遍的數據——轉變為一個可擴展的、定量的窗口，觀察腫瘤及其微環境，為轉化和臨床研究中的下一代基於組織的生物標誌物奠定基礎。

Using Explainability as a Training-Time Reliability Signal for Efficient ECG Classification

2606.12252v1 by Veerendhra Kumar Dangeti, Xiao Gu, Ying Weng, Shreyank N Gowda

Training deep neural networks for clinical time-series analysis is computationally demanding, yet many healthcare settings lack the resources required for repeated model development and deployment. This challenge is particularly evident in electrocardiogram classification, where large datasets and long training schedules make efficiency practically important. Progressive Data Dropout reduces training cost by excluding samples from gradient updates once they are learned, but it relies on model confidence and may retain samples that are difficult due to noise or ambiguity rather than useful signal. In this work, we introduce ERTS, an explainability-based reliability training signal for efficient ECG classification. ERTS uses explanation quality during training to distinguish between informative and unreliable uncertainty. Building on progressive data selection, we compute Grad-CAM attention maps for candidate samples and derive a focus score that measures whether model predictions are supported by coherent and localised patterns. Samples with low focus are filtered out, while those with meaningful attention are prioritised for gradient updates. We evaluate ERTS across three ECG datasets and multiple backbone architectures, showing consistent improvements in macro-F1 alongside reduced effective training cost. These results suggest that explanation quality can serve as a practical signal for improving both efficiency and reliability in clinical time-series learning. Code will be released.

摘要：訓練深度神經網絡以進行臨床時間序列分析在計算上要求甚高，然而許多醫療環境缺乏重複模型開發和部署所需的資源。這一挑戰在心電圖分類中特別明顯，因為大型數據集和長時間的訓練計劃使得效率變得非常重要。漸進式數據丟棄通過在樣本被學習後排除其對梯度更新的貢獻來降低訓練成本，但它依賴於模型信心，可能會保留由於噪聲或模糊而難以處理的樣本，而不是有用的信號。在這項工作中，我們介紹了ERTS，一種基於可解釋性的可靠性訓練信號，用於高效的心電圖分類。ERTS在訓練期間使用解釋質量來區分信息性和不可靠的不確定性。基於漸進式數據選擇，我們計算候選樣本的Grad-CAM注意力圖，並導出一個焦點分數，以衡量模型預測是否得到一致且局部化模式的支持。低焦點的樣本會被過濾掉，而那些具有意義的注意力的樣本則優先進行梯度更新。我們在三個心電圖數據集和多個主幹架構上評估了ERTS，顯示出宏觀F1分數的一致改善，同時有效的訓練成本降低。這些結果表明，解釋質量可以作為改善臨床時間序列學習中效率和可靠性的實用信號。代碼將會發布。

Reinforcement Learning Disrupts Gradient-Based Adversarial Optimization

2606.12251v1 by Xinhai Zou, Chang Zhao, Alireza Aghabagherloo, Dave Singelée, Robin Degraeve, Bart Preneel

Gradient-based adversarial attacks remain a dominant threat to deep neural networks (DNNs), as they exploit gradient information to efficiently optimize adversarial perturbations. To address this, we investigate whether reinforcement learning (RL) training can disrupt the gradient structure used by attackers by training image classifiers with policy-gradient objectives and epsilon-greedy exploration. Through systematic experiments across CIFAR-10, CIFAR-100, and ImageNet-100 with multiple architectures, we find that RL-trained classifiers significantly disrupt gradient-based adversarial optimization. To explain this, we conduct a comprehensive mechanism analysis using loss landscape visualization, static and dynamic gradient indicators, and predictive entropy. Our analysis reveals that RL acts as an implicit regularizer, producing models with highly unstable gradient directions and smaller gradient magnitudes. This combination makes each PGD step both unreliable in direction and limited in magnitude, causing gradient-based attacks to fail within practical iteration budgets. We further show that combining RL with adversarial training (RL-adv) provides a dual-layer defense operating at two complementary levels: RL degrades gradient information available to attackers (gradient-level defense), while adversarial training strengthens decision boundaries (boundary-level defense). RL-adv achieves the highest robustness across all major attack types evaluated, including gradient-based (PGD, AutoAttack), transfer-based, and query-based attacks, outperforming SL-adv by a significant margin. These findings identify RL-induced gradient disruption as a complementary robustness mechanism and motivate future research on hybrid SL-RL training schedules that combine SL's efficiency with RL's gradient-regularization properties.

摘要：梯度基礎的對抗攻擊仍然是深度神經網絡（DNNs）面臨的主要威脅，因為它們利用梯度信息有效地優化對抗擾動。為了解決這個問題，我們研究增強學習（RL）訓練是否能通過使用策略梯度目標和ε-貪婪探索來破壞攻擊者使用的梯度結構，從而訓練圖像分類器。通過在CIFAR-10、CIFAR-100和ImageNet-100上進行多種架構的系統實驗，我們發現RL訓練的分類器顯著破壞了基於梯度的對抗優化。為了解釋這一點，我們進行了全面的機制分析，使用損失景觀可視化、靜態和動態梯度指標以及預測熵。我們的分析顯示，RL作為一種隱式正則化器，產生具有高度不穩定梯度方向和較小梯度幅度的模型。這種組合使得每一步PGD在方向上都不可靠且幅度有限，導致基於梯度的攻擊在實際迭代預算內失敗。我們進一步顯示，將RL與對抗訓練結合（RL-adv）提供了一種雙層防禦，運作於兩個互補層面：RL降低了攻擊者可用的梯度信息（梯度層防禦），而對抗訓練則加強了決策邊界（邊界層防禦）。RL-adv在評估的所有主要攻擊類型中實現了最高的魯棒性，包括基於梯度的（PGD、AutoAttack）、基於轉移的和基於查詢的攻擊，並且顯著超越了SL-adv。這些發現確定了RL引起的梯度破壞作為一種互補的魯棒性機制，並激勵未來對結合SL效率和RL梯度正則化特性的混合SL-RL訓練計劃進行研究。

Towards Responsibly Non-Compliant Machines

2606.12147v1 by Marija Slavkovik, Marie Farrell, Louise Dennis, Michael Fisher, Simon Kolker, Emily C. Collins

We consider the problem of engineering autonomous intelligent agents that are capable to responsibly not comply with user requests. We argue that machine non-compliance comes in many different forms, and sketch the issues we should pursue on the road of accomplishing responsibly non-compliant intelligent machines. We anchor responsible non-compliance in justifications for task refusal, pathways to override the non-compliance, as well as careful tracking of security risks and liability transfers.

摘要：我們考慮工程自主智能代理的問題，這些代理能夠負責任地不遵從用戶請求。我們認為機器的不遵從有許多不同的形式，並勾勒出我們在實現負責任的不遵從智能機器的過程中應該追求的問題。我們將負責任的不遵從建立在拒絕任務的理由、覆蓋不遵從的途徑，以及對安全風險和責任轉移的仔細追蹤上。

Unstable Features, Reproducible Subspaces: Understanding Seed Dependence in Sparse Autoencoders

2606.12138v1 by Gleb Gerasimov, Timofei Rusalev, Nikita Balagansky, Daniil Laptev, Vadim Kurochkin, Daniil Gavrilov

Sparse autoencoders (SAEs) are widely used to interpret neural network representations, but their utility depends on whether the learned features are reproducible across training runs. We study this question through \emph{feature stability}: for each SAE feature, we estimate the probability that a similar feature reappears in an independently trained SAE. This yields a scalable per-feature signal that separates stable from unstable features. In a large-scale study across seeds, models, layers, dictionary sizes, and SAE variants, we find a pronounced functional asymmetry: stable features carry most of the reconstruction- and prediction-relevant signal, while unstable features have weak marginal impact and are dominated by low-frequency surface-form triggers in both activation statistics and automatic explanations. Geometrically, unstable features are individually non-reproducible but concentrate in reproducible lower-rank subspaces, suggesting that seed dependence often reflects basis ambiguity within a shared region of activation space rather than pure noise. A controlled synthetic model makes this mechanism explicit, showing that low-rank ground-truth features can be recovered at the subspace level while remaining non-identifiable as individual SAE latents across seeds. Finally, by pooling unique cross-seed features, we construct more stable SAEs while preserving explained variance in this setting. Together, these results show that unstable features are not merely failed or noisy latents: they have weak individual functional impact, but reflect reproducible low-dimensional structure that standard SAEs resolve differently across seeds.

摘要：稀疏自編碼器（SAEs）被廣泛用於解釋神經網絡表示，但它們的效用取決於學習到的特徵是否在訓練過程中可重現。我們通過\emph{特徵穩定性}來研究這個問題：對於每個SAE特徵，我們估計在獨立訓練的SAE中相似特徵重新出現的概率。這產生了一個可擴展的每特徵信號，將穩定特徵與不穩定特徵區分開來。在一項大規模的研究中，涵蓋了隨機種子、模型、層、字典大小和SAE變體，我們發現了一種明顯的功能不對稱性：穩定特徵攜帶大部分重建和預測相關信號，而不穩定特徵的邊際影響較弱，並且在激活統計和自動解釋中被低頻表面形式觸發所主導。在幾何上，不穩定特徵在個體上是不可重現的，但集中在可重現的低秩子空間中，這表明種子依賴性往往反映了共享激活空間區域內的基底模糊性，而不是純粹的噪音。一個受控的合成模型使這一機制變得明確，顯示低秩真實特徵可以在子空間層面上被恢復，同時在不同種子之間仍然無法識別為個別的SAE潛變量。最後，通過聚合獨特的跨種子特徵，我們構建了更穩定的SAEs，同時在這一設置中保留了解釋的變異性。綜合這些結果顯示，不穩定特徵不僅僅是失敗或嘈雜的潛變量：它們具有弱的個體功能影響，但反映了可重現的低維結構，而標準SAEs在不同種子之間以不同方式解決這些結構。

Tabular Foundation Models for Clinical Survival Analysis via Survival-Aware Adaptation

2606.12006v1 by Minh-Khoi Pham, Luca Cotugno, Alina Sirbu, Tai Tan Mai, Martin Crane, Marija Bezbradica

Predicting time-to-event outcomes such as mortality is a fundamental task in clinical decision-making, commonly addressed through survival analysis. While classical statistical and deep learning approaches have been widely studied, they typically require task-specific training and sufficient labeled data. Recent advances in tabular foundation models offer a new paradigm by learning general-purpose representations for structured data. However, their applicability to censored time-to-event prediction in clinical settings remains underexplored, as typical applications are restricted to discrete classification rather than survival analysis tasks. In this work, we propose a lightweight adaptation approach for applying tabular foundation models to clinical survival analysis by directly training a survival-aware head on top of the pretrained representations. We study representative architectures, including TabPFN, TabDPT, and TabICL, and adapt them using a multi-task logistic regression (MTLR) head to model right-censored time-to-event outcomes. We evaluate this approach on a diverse set of public survival benchmarks and two large-scale ICU cohorts, MIMIC-IV and eICU. Our results show that this transfer learning approach achieves competitive or superior performance compared to strong baselines. On MIMIC-IV, TabDPT-FT-MTLR reaches a C-index of 0.856, corresponding to a relative improvement of +1.4% over the best non-FM baseline (DeepSurv, 0.844) and +6.7% over the best zero-shot model (0.802). On eICU, TabICL-FT-MTLR achieves 0.797, yielding gains of +1.7% (DeepSurv, 0.784) and +6.4% (0.749), respectively. These findings highlight the importance of combining pretrained tabular representations with survival-aware objectives and suggest that tabular foundation models provide a practical and effective alternative for clinical survival prediction.

摘要：預測事件發生時間的結果，例如死亡率，是臨床決策中的一項基本任務，通常通過生存分析來解決。雖然傳統的統計方法和深度學習方法已被廣泛研究，但這些方法通常需要特定任務的訓練和足夠的標記數據。最近在表格基礎模型方面的進展提供了一種新的範式，通過學習結構化數據的一般性表示來進行處理。然而，這些模型在臨床環境中對於被審查的事件預測的適用性仍然未被充分探索，因為典型應用主要限於離散分類，而非生存分析任務。在本研究中，我們提出了一種輕量級的適應方法，通過在預訓練表示的基礎上直接訓練一個生存感知的頭部，將表格基礎模型應用於臨床生存分析。我們研究了代表性的架構，包括TabPFN、TabDPT和TabICL，並使用多任務邏輯回歸（MTLR）頭部進行調整，以建模右審查的事件結果。我們在一組多樣的公共生存基準和兩個大型ICU隊列（MIMIC-IV和eICU）上評估了這一方法。我們的結果顯示，這種轉移學習方法在與強基準相比時，達到了競爭或更優的性能。在MIMIC-IV上，TabDPT-FT-MTLR達到了0.856的C指數，相當於比最佳非FM基準（DeepSurv，0.844）提高了+1.4%，比最佳零樣本模型（0.802）提高了+6.7%。在eICU上，TabICL-FT-MTLR達到了0.797，分別帶來+1.7%（DeepSurv，0.784）和+6.4%（0.749）的增益。這些發現突顯了將預訓練的表格表示與生存感知目標相結合的重要性，並表明表格基礎模型為臨床生存預測提供了一種實用且有效的替代方案。

Frozen Multimodal Embeddings for AI-Assisted Interview Assessment of Personality and Cognitive Ability

2606.11930v2 by Kuo-En Hung, Hung-Yue Suen, Shih-Ching Yeh, Hsiang-Wen Wang

Predicting psychological traits from asynchronous video interviews (AVIs) is a challenging problem in AI-assisted interview assessment because labeled datasets are limited while each response contains high-dimensional visual, acoustic, and verbal signals. This paper presents our solution for the ACM Multimedia AVI Challenge 2026, which evaluates two tasks: Track~1 predicts self-reported HEXACO personality traits from personality-related interview responses, and Track~2 classifies cognitive ability levels from structured AVI responses. We treat the problem as a small-sample representation learning task. Instead of fine-tuning large pretrained models, we use frozen multimodal encoders, including CLIP for visual features, Whisper for acoustic features and transcripts, and RoBERTa, E5, and DeBERTaV3 for textual representations, followed by low-capacity downstream models. For Track~1, our trait-specific regression and late-fusion system achieves an average validation MSE of 0.2696, improving over the official baseline of 0.3334. Ablation results show a three-step improvement from a global model (0.3189), to per-trait modeling (0.2871), to per-trait late fusion (0.2696), corresponding to a 19.1% relative MSE reduction over the official baseline. For Track~2, a compact subject-attribute baseline reaches 0.5781 accuracy, while our multimodal ensemble reaches 0.5313, both above the official baseline of 0.4062. We interpret this result as evidence of possible subject-attribute shortcuts in the validation split rather than robust cognitive inference from AVI content. Overall, our findings suggest that AVI-based psychological assessment benefits from trait-specific multimodal modeling, but cognitive ability prediction requires careful control of dataset shortcuts.

摘要：預測來自非同步視頻面試（AVI）的心理特徵是一個在AI輔助面試評估中具有挑戰性的問題，因為標記數據集有限，而每個回應包含高維度的視覺、聲音和語言信號。本文提出了我們對2026年ACM多媒體AVI挑戰的解決方案，該挑戰評估兩項任務：Track~1從與人格相關的面試回應中預測自我報告的HEXACO人格特徵，Track~2則從結構化的AVI回應中分類認知能力水平。我們將這個問題視為一個小樣本表示學習任務。我們不對大型預訓練模型進行微調，而是使用凍結的多模態編碼器，包括用於視覺特徵的CLIP、用於聲音特徵和文字稿的Whisper，以及用於文本表示的RoBERTa、E5和DeBERTaV3，然後再用低容量的下游模型。對於Track~1，我們的特徵特定回歸和後期融合系統達到了0.2696的平均驗證均方誤差（MSE），超過了官方基準的0.3334。消融結果顯示，從全局模型（0.3189）到每個特徵建模（0.2871），再到每個特徵的後期融合（0.2696），經歷了三步改進，對應於相對於官方基準的19.1% MSE減少。對於Track~2，一個緊湊的主題-屬性基準達到了0.5781的準確率，而我們的多模態集成達到了0.5313，均高於官方基準的0.4062。我們將這一結果解釋為在驗證拆分中可能存在的主題-屬性捷徑的證據，而不是從AVI內容中進行穩健的認知推斷。總體而言，我們的發現表明，基於AVI的心理評估受益於特徵特定的多模態建模，但認知能力預測需要對數據集捷徑進行仔細控制。

Beyond representational alignment with brain-guided language models for robust reasoning

2606.11893v1 by Mingqing Xiao, Kai Du, Zhouchen Lin

The correspondence between large language models (LLMs) and the neural mechanisms underlying human higher-order cognition remains insufficiently characterized. Given that language and reasoning in the human brain appear dissociable, an open question is whether LLMs align with neural signals from reasoning-related regions and whether such signals can improve them. Here, focusing on deductive reasoning, we show that LLM internal representations are not only partially aligned with task-fMRI activity but can also be directly enhanced by these signals. Using a neural-predictivity metric, we find that LLMs explain a substantial fraction of the explainable variance in reasoning-related regions at the aggregate level, whereas predictivity within specific reasoning types is lower, indicating both alignment and divergence. Building on this, we propose a brain-guided framework: we steer model representations along directions induced by the joint structure of model and brain representations, applying intervention at inference and fine-tuning during training. We demonstrate that task-evoked brain signals can directly enhance LLM reasoning, yielding gains orthogonal to language-only supervision across 10 LLMs (1.5B-72B), with transfer across reasoning types and up to 13\% absolute accuracy gain. Our results advance LLM-brain correspondences from correlation to guidance, establishing a brain-signal-driven pathway toward more robust and cognitively aligned AI.

摘要：大型語言模型（LLMs）與人類高階認知背後的神經機制之間的對應關係仍然不足以被充分描述。考慮到人腦中的語言和推理似乎是可分離的，一個未解的問題是，LLMs 是否與推理相關區域的神經信號對齊，以及這些信號是否能改善它們。在此，我們專注於演繹推理，顯示 LLM 的內部表徵不僅部分與任務功能性磁共振成像（fMRI）活動對齊，還可以被這些信號直接增強。使用神經預測性指標，我們發現 LLM 在整體層面上解釋了推理相關區域中可解釋變異的相當一部分，而在特定推理類型中的預測性較低，這表明了對齊和分歧的存在。在此基礎上，我們提出了一個腦導向框架：我們沿著模型和大腦表徵的聯合結構所誘導的方向引導模型表徵，在推理時應用干預並在訓練期間進行微調。我們證明了任務引發的大腦信號可以直接增強 LLM 的推理，帶來與僅依賴語言的監督相互正交的增益，涵蓋 10 個 LLM（1.5B-72B），並在推理類型之間轉移，實現高達 13\% 的絕對準確率增益。我們的結果將 LLM 與大腦的對應關係從相關性推進到引導，建立了一條以大腦信號驅動的通道，朝向更穩健且與認知對齊的人工智慧。

Skill-Augmented AI Agents for Medical Research Analysis: An Exploratory Multi-Model Human Evaluation in an NSCLC Transcriptomic Biomarker Task

2606.11830v1 by Qianyu Yao, Fei Sun, Bocheng Huang, Wei Chen, Jiarui Jiang, Shu Quan, Yifei Chen, Wenjie Xu, Bo li, Liping Su, Ruoqiong Wu, Huhai Hong, Huimei Wang

Background. Large language models and AI agents are increasingly used to support biomedical research, but native model outputs may omit key analytical steps, misuse methods, or overstate conclusions. We evaluated whether autonomous access to a medical research skill package was associated with higher-quality AI-generated transcriptomic research-analysis outputs compared with native AI without skills. Methods. We conducted an exploratory multi-model human evaluation using a non-small cell lung cancer immunotherapy biomarker task. Six model backbones were tested. The evaluation included 21 anonymized outputs: 9 native-AI outputs and 12 skill-augmented outputs generated through an AI agent implementation represented by OpenClaw. Four non-expert biomedical reviewers and two blinded experts evaluated each output, with two ratings from each reviewer type. The primary outcome was expert-rated overall quality. Results. Skill-augmented outputs showed directionally higher expert overall quality than native-AI outputs (mean 5.50 vs 5.11; difference=0.39; bootstrap 95\% CI, -0.04 to 0.90; Welch p=0.156). Non-expert reviewer quality showed the same direction (mean 4.72 vs 4.47; difference=0.26; bootstrap 95\% CI, -0.25 to 0.80; Welch p=0.373). Expert agreement was limited (single-rating ICC=-0.15), and model-specific effects were descriptive and heterogeneous. Conclusions. Autonomous skill access showed a directional quality signal in this exploratory sample, but the signal was smaller than expert-rating noise and should not be interpreted as confirmatory evidence. The findings primarily motivate larger evaluations of skill-augmented AI agents with stronger reliability controls, platform replication, and biological-validity assessment.

摘要：背景。大型語言模型和人工智慧代理越來越多地用於支持生物醫學研究，但原生模型的輸出可能省略關鍵的分析步驟、誤用方法或過度陳述結論。我們評估了自主訪問醫學研究技能包是否與較高質量的AI生成轉錄組研究分析輸出相關，與沒有技能的原生AI相比。方法。我們使用非小細胞肺癌免疫療法生物標記任務進行了探索性的多模型人類評估。測試了六個模型骨幹。評估包括21個匿名輸出：9個原生AI輸出和12個通過AI代理實現的技能增強輸出，該代理由OpenClaw表示。四位非專家生物醫學評審和兩位盲評專家評估了每個輸出，每種類型的評審提供了兩個評分。主要結果是專家評定的整體質量。結果。技能增強輸出的專家整體質量方向性上高於原生AI輸出（平均5.50對5.11；差異=0.39；自助法95\% CI，-0.04至0.90；Welch p=0.156）。非專家評審的質量顯示相同的方向（平均4.72對4.47；差異=0.26；自助法95\% CI，-0.25至0.80；Welch p=0.373）。專家之間的協議有限（單次評分ICC=-0.15），模型特定的效應是描述性的和異質的。結論。在這個探索性樣本中，自主技能訪問顯示出方向性的質量信號，但該信號小於專家評分的噪音，不應被解釋為確認性證據。這些發現主要促使對技能增強AI代理進行更大規模的評估，並加強可靠性控制、平台重複性和生物有效性評估。

Multimodal Ordinal Modeling of Alzheimer's Disease Severity Using Structural MRI and Clinical Data

2606.11794v1 by Boris-Stephan Rauchmann, Jonathan Laib, Buse Ercik, Robert Perneczky, Sergio Altares-López

Neurodegenerative diseases such as Alzheimer's disease (AD) require accurate and scalable tools for assessing disease severity, yet current clinical staging remains time-intensive and prone to variability. We propose an attention-enhanced multimodal machine learning framework with ordinal regression for automated and interpretable AD severity staging. The framework integrates T1-weighted MRI with demographic and genetic variables and compares unimodal and multimodal architectures using ordinal and non-ordinal prediction heads. Models were trained and validated using cohort-stratified splits derived from the ADNI, AIBL, and NIFD datasets. A strictly held-out test set was constructed using subjects excluded from all training, validation, preprocessing, and hyperparameter tuning procedures, with subject-level splitting employed throughout to prevent data leakage. Among unimodal approaches, the T1-weighted MRI model achieved slightly higher adjacent-stage accuracy (0.963) and agreement with clinical staging (QWK 0.444) than the tabular model (QWK 0.433). Integrating imaging, demographic, and genetic information improved overall performance. The multimodal non-ordinal baseline achieved the lowest prediction error (MAE 0.340), whereas the ordinal multimodal model achieved the highest adjacent-stage accuracy (0.970) and strongest agreement with clinical staging (QWK 0.549). These findings indicate that ordinal formulations better capture the ordered structure of the CDR scale and yield predictions more consistent with clinical staging. Explainability analyses using Grad CAM++ and SHAP demonstrated anatomically and clinically plausible model behavior, supporting transparent decision-making. Overall, attention-based multimodal learning with ordinal regression represents a robust, interpretable, and scalable approach for automated AD severity staging and AI-assisted clinical decision support.

摘要：神經退行性疾病，如阿茲海默症（AD），需要準確且可擴展的工具來評估疾病嚴重程度，但目前的臨床分期仍然耗時且容易變異。我們提出了一種增強注意力的多模態機器學習框架，結合序數回歸，用於自動化且可解釋的AD嚴重程度分期。該框架整合了T1加權MRI與人口統計和遺傳變數，並使用序數和非序數預測頭比較單模態和多模態架構。模型使用來自ADNI、AIBL和NIFD數據集的隊列分層拆分進行訓練和驗證。嚴格保留的測試集是使用所有訓練、驗證、預處理和超參數調整程序中排除的受試者構建的，並在整個過程中採用了受試者級別的拆分以防止數據洩漏。在單模態方法中，T1加權MRI模型的相鄰階段準確率（0.963）和與臨床分期的一致性（QWK 0.444）略高於表格模型（QWK 0.433）。整合影像、人口統計和遺傳信息提高了整體性能。多模態非序數基線達到了最低的預測誤差（MAE 0.340），而序數多模態模型則達到了最高的相鄰階段準確率（0.970）和與臨床分期的最強一致性（QWK 0.549）。這些發現表明，序數公式更好地捕捉了CDR量表的有序結構，並產生與臨床分期更一致的預測。使用Grad CAM++和SHAP的可解釋性分析顯示了在解剖學和臨床上合理的模型行為，支持透明的決策過程。總體而言，基於注意力的多模態學習結合序數回歸代表了一種穩健、可解釋且可擴展的自動化AD嚴重程度分期和AI輔助臨床決策支持的方法。

Extracting Semantics: LLM-Guided Automatic Population of Robot Ontology from URDF

2606.17073v1 by Bastien Dussard, Guillaume Sarthou

While commonsense knowledge may suffice for virtual agents, embodied robots interacting with humans require grounded and semantically rich representations of both their environment and their own physical embodiment. In cognitive robotics, ontologies are effective for integrating such heterogeneous knowledge to enable explainable reasoning, even during continuous knowledge updates. Yet, their manual construction remains a bottleneck. We present a preliminary approach for the automatic generation of robot semantic abstractions by transforming Unified Robot Description Format (URDF) models into populated ontologies. Although URDF files provide structural and kinematic descriptions, their identifiers often require commonsense interpretation to recover meaningful semantics, a task at which Large Language Models (LLMs) excel. Our pipeline leverages LLMs to infer semantic relationships by prompting them with concepts from an existing ontology, ensuring the final classification remains aligned with the formal model. To improve reliability, the pipeline combines majority voting across multiple LLM queries along with syntactic and schema-level validation to ensure that generated outputs conform to the expected representation format and ontology constraints. We evaluate the approach on multiple robot descriptions and discuss the generated abstractions. Initial results indicate that the proposed method can effectively bridge the gap between low-level robot descriptions and the structured, grounded knowledge representations required for human-robot interaction.

摘要：雖然常識知識對虛擬代理可能足夠，但與人類互動的具身機器人則需要對其環境及自身物理體現的有根據且語義豐富的表徵。在認知機器人學中，本體論有效地整合這些異質知識，以實現可解釋的推理，即使在持續的知識更新過程中也是如此。然而，其手動構建仍然是一個瓶頸。我們提出了一種初步方法，通過將統一機器人描述格式（URDF）模型轉換為填充本體，自動生成機器人語義抽象。儘管URDF文件提供了結構和運動學描述，但其標識符通常需要常識解釋以恢復有意義的語義，而這正是大型語言模型（LLMs）擅長的任務。我們的流程利用LLMs通過用現有本體的概念提示它們來推斷語義關係，確保最終分類與正式模型保持一致。為了提高可靠性，該流程結合了多個LLM查詢的多數投票，以及語法和模式層級的驗證，以確保生成的輸出符合預期的表徵格式和本體約束。我們在多個機器人描述上評估該方法並討論生成的抽象。初步結果表明，所提出的方法可以有效地彌合低階機器人描述與人機互動所需的結構化、有根據的知識表徵之間的差距。

MedCTA: A Benchmark for Clinical Tool Agents

2606.11702v1 by Tajamul Ashraf, Hyewon Jeong, Fida Mohammad Thoker, Bernard Ghanem

To make clinically grounded decisions, medical AI agents are expected to go beyond simple recognition and be capable of tool retrieval, evidence acquisition, and integration. Existing benchmarks largely evaluate isolated perception or single-turn question answering, and therefore provide limited visibility into failures of planning, tool recruitment, and rollout reliability. We introduce MedCTA, a benchmark for evaluating medical tool agents on clinician-validated, step-implicit tasks grounded in realistic multimodal clinical inputs, including radiology images, pathology slides, and reports. MedCTA comprises 107 real-world clinical tasks with clinician-verified executable trajectories over 5 deployed tools, and supports process-aware evaluation of tool selection, argument validity, execution stability, trajectory fidelity, and outcome quality. We benchmark 18 open- and closed-source multimodal models and find that even frontier systems remain brittle in multi-step clinical tool use: autonomous rollouts are dominated by protocol failures, premature stopping, and incorrect tool recruitment, while gold-standard tool routing yields large but still incomplete gains. These results show that strong backbone perception does not translate into reliable agentic behavior in clinical settings. MedCTA provides a rigorous testbed for auditing, diagnosing, and advancing trustworthy medical AI agents. The dataset and evaluation suite are available at https://ivul-kaust.github.io/MedCTA/

摘要：為了做出臨床基礎的決策，醫療 AI 代理預期能超越簡單的識別，具備工具檢索、證據獲取和整合的能力。現有的基準主要評估孤立的感知或單輪的問題回答，因此對於計劃失敗、工具招募和推行可靠性提供了有限的可見性。我們介紹了 MedCTA，一個用於評估醫療工具代理的基準，基於臨床醫生驗證的、隱含步驟的任務，這些任務根植於現實的多模態臨床輸入，包括放射學影像、病理切片和報告。MedCTA 包含 107 個真實世界的臨床任務，具有臨床醫生驗證的可執行軌跡，涵蓋 5 個已部署的工具，並支持對工具選擇、論據有效性、執行穩定性、軌跡保真度和結果質量的過程感知評估。我們對 18 個開源和閉源的多模態模型進行基準測試，發現即使是最前沿的系統在多步臨床工具使用中仍然脆弱：自主推行受到協議失敗、過早停止和不正確工具招募的主導，而黃金標準的工具路由雖然帶來了巨大的但仍然不完整的收益。這些結果表明，強大的基幹感知並不會轉化為臨床環境中可靠的代理行為。MedCTA 提供了一個嚴謹的測試平台，用於審核、診斷和推進可信的醫療 AI 代理。數據集和評估套件可在 https://ivul-kaust.github.io/MedCTA/ 獲得。

Quickest Detection of Hallucination Onset: Delay Bounds and Learned CUSUM Statistics

2606.12476v2 by Igor Itkin

Token-level hallucination detectors are evaluated as classifiers, by AUC over all tokens, yet a streaming monitor is judged by its reaction time: the number of tokens that pass between the onset of a hallucination and the alarm. We formulate hallucination onset detection as a quickest change detection problem. A first-order Markov model of the latent faithful/hallucinated state, validated on RAGTruth, places the task inside classical change-point theory and yields Lorden's lower bound on detection delay: about 1.3 tokens at a false-alarm rate of 0.01. We then show that a causal recurrent labeler acts as a CUSUM with a learned increment. Among the onsets it catches it detects in 11-13 tokens, against 31 for a linear per-token baseline, though at this false-alarm budget every detector catches under a third of onsets and the recall-honest delay is 56-66 tokens: low-false-alarm onset detection is hard. A controlled decomposition attributes the speed advantage mostly to a better per-token score rather than to temporal accumulation. An information-rate optimality theorem of Donsker-Varadhan type explains the remaining order-of-magnitude gap: the learned score realizes only 1/4.5 of the divergence the features carry, a deficit that recalibration cannot remove, with the remainder a finite-horizon effect. Classification metrics conceal this delay structure; sequential analysis makes it measurable.

摘要：標記級別的幻覺檢測器作為分類器進行評估，通過所有標記的AUC來衡量，然而流式監控則根據其反應時間來判斷：在幻覺開始和警報之間通過的標記數量。我們將幻覺開始檢測形式化為一個最快變化檢測問題。一個基於潛在真實/幻覺狀態的一階馬可夫模型，在RAGTruth上進行驗證，將任務置於經典的變化點理論之內，並產生Lorden的檢測延遲下限：在假警報率為0.01時約為1.3個標記。我們接著展示了一個因果循環標記器作為具有學習增量的CUSUM運作。在它捕捉到的開始中，它在11-13個標記內檢測到，而線性每標記基準則為31，儘管在這個假警報預算下，每個檢測器捕捉到的開始都不到三分之一，而回憶誠實的延遲為56-66個標記：低假警報的開始檢測是困難的。一個受控的分解將速度優勢主要歸因於更好的每標記得分，而不是時間累積。Donsker-Varadhan類型的信息率最優定理解釋了剩餘的量級差距：學習得分僅實現了特徵所承載的散度的1/4.5，這一缺口無法通過重新校準消除，其餘則是有限視野效應。分類指標掩蓋了這一延遲結構；序列分析使其可測量。

Can AI Agents Synthesize Scientific Conclusions?

2606.11337v1 by Hayoung Jung, Pedro Viana Diniz, José Reinaldo Corrêa Roveda, Abner Fernandes da Silva, Haeun Jung, Enoch Tsai, Aleksandra Korolova, Manoel Horta Ribeiro

Scientific AI agents increasingly retrieve evidence, reason across sources, and synthesize conclusions used in consequential decisions. Yet, their ability to do so in high-stakes domains such as health remains unclear. We introduce SciConBench, a large-scale live benchmark of 9.11K questions and expert-written conclusions from systematic reviews to evaluate open-domain scientific conclusion synthesis. The benchmark draws on an expert-validated automated evaluation pipeline that decomposes conclusions into atomic facts and measures correctness and comprehensiveness via factual precision and recall. To mitigate data leakage, we further introduce SciConHarness, a clean-room evaluation harness that equips agents with controlled web interaction to ensure valid measurement. Evaluating 8 frontier models and deep research agents, we find that factual quality remains low: under clean-room settings, the best agent achieves only a factual F1 of 0.337. Our clean-room setting consistently reduces performance relative to unconstrained evaluation, suggesting that leakage inflates estimates of models' true synthesis capabilities. Finally, we audit consumer-facing agents (e.g., Google AI Overview, OpenEvidence) and find they frequently generate incomplete and sometimes contradictory conclusions, even when the ground-truth answer is available. Overall, our results show that reliable synthesis of scientific conclusions remains an open challenge, and that clean-room evaluation is essential for assessing open-domain AI agents.

摘要：科學 AI 代理越來越多地檢索證據、跨來源推理並綜合用於重要決策的結論。然後，它們在健康等高風險領域的能力仍然不明確。我們介紹了 SciConBench，這是一個大規模的實時基準，包含 9.11K 問題和專家撰寫的系統評價結論，用於評估開放領域的科學結論綜合。該基準依賴於經專家驗證的自動評估管道，將結論分解為原子事實，並通過事實精確度和召回率來衡量正確性和全面性。為了減少數據洩漏，我們進一步引入了 SciConHarness，這是一個清潔室評估工具，為代理提供受控的網絡互動，以確保有效的測量。評估 8 個前沿模型和深度研究代理時，我們發現事實質量仍然較低：在清潔室設置下，最佳代理的事實 F1 僅達 0.337。我們的清潔室設置相對於不受限制的評估持續降低性能，這表明洩漏會膨脹模型的真實綜合能力的估計。最後，我們審計了面向消費者的代理（例如，Google AI 概覽，OpenEvidence），發現它們經常生成不完整且有時矛盾的結論，即使當真實答案可用時也是如此。總體而言，我們的結果顯示，可靠的科學結論綜合仍然是一個未解決的挑戰，而清潔室評估對於評估開放領域的 AI 代理至關重要。

Designed by Journalists, but Is It for Readers? Rethinking AI Disclosures and Transparency in News

2606.11116v1 by Pooja Prajod

As newsrooms integrate generative AI, journalists face a disclosure challenge: how to communicate AI involvement in ways that maintain reader trust. Current practice offers two approaches: brief one-line labels or detailed disclosures specifying human oversight, editorial accountability, and error reporting mechanisms. Neither achieves journalists' goal of building trust through transparency. An existing controlled experiment with 34 news readers show that detailed disclosures trigger a \textit{transparency dilemma}, reducing trust rather than increasing it, and risk introducing dark patterns that readers scroll past with the illusion of transparency. One-line disclosures avoid this effect but can create an information gap, prompting readers to expend cognitive effort searching for signs of AI involvement that the disclosure indicates but does not explain. Yet readers are not rejecting transparency, they proposed disclosure designs centered on user agency: detail-on-demand interactions, proportional AI-ratio visualizations, outlet-level signals, and explicit "no AI" labels. I argue that this disconnect between what practitioners believe is responsible disclosure and what users actually need is a design problem for the HCI community.

摘要：隨著新聞編輯部整合生成式 AI，記者面臨一個披露挑戰：如何以維持讀者信任的方式傳達 AI 的參與。當前的做法提供了兩種方法：簡短的一行標籤或詳細的披露，具體說明人類監督、編輯責任和錯誤報告機制。這兩者都未能實現記者通過透明度建立信任的目標。一項對 34 位新聞讀者進行的現有控制實驗顯示，詳細的披露引發了 \textit{透明度困境}，降低了信任而非增加信任，並且有引入黑暗模式的風險，讓讀者在錯誤的透明感中滑過。單行披露避免了這種效果，但可能會造成信息缺口，促使讀者花費認知精力尋找披露所指示但未解釋的 AI 參與跡象。然而，讀者並不拒絕透明度，他們提出了以用戶主體性為中心的披露設計：按需詳細互動、比例 AI 比例可視化、媒體層級信號以及明確的「無 AI」標籤。我認為，從業者認為負責任的披露與用戶實際需求之間的這一脫節是 HCI 社群的一個設計問題。

FADA: Accessible fetal ultrasound interpretation and annotation with a selectively distilled unified vision-language model

2606.11106v1 by Mahmood Alzubaidi, Uzair Shah, Raden Muaz, Ines Abbes, Nader Mohammed, Abdullatif Magram, Khalid Alyafei, Mowafa Househ, Marco Agus

A global shortage of trained sonographers limits prenatal ultrasound screening in low- and middle-income countries, where over half of pregnant women receive no skilled sonography. Current deep learning approaches address detection, segmentation, or classification in isolation, each demanding a separate model and expert-specified labels at inference. We present FADA, a unified vision-language model built on Qwen3.5-VL that performs clinical interpretation, classification, detection, and segmentation through a single interpretation-first pipeline without external labels. FADA distills knowledge from four domain-specific foundation models (FetalCLIP, UltraSAM, USF-MAE, UltraFedFM) via offline pre-computed feature caching. Selective distillation, which applies feature alignment only to annotation tasks while interpretation relies on standard fine-tuning, consistently outperforms full distillation across most evaluation axes. The recommended variant, FADA-SKD, achieves 0.8820 mean Dice for segmentation, 0.7671 mAP@0.50 for detection, and 100% structured interpretation compliance. Expert sonographer validation across 237 images confirms clinically acceptable outputs in both autonomous and human-in-the-loop modes, with 73.5% of interpretations scoring perfectly under clinician guidance. The system is trainable on a single consumer GPU and deployable without cloud connectivity. We validate edge deployment by running the compressed 0.8B model on a commodity smartphone (Qualcomm Snapdragon 7 Gen 1, 12 GB RAM) using llama.cpp with GGUF quantization, completing the full 5-phase pipeline in approximately 60 seconds entirely offline. This establishes a practical pathway for integrating AI-assisted fetal assessment with portable ultrasound devices, directly addressing diagnostic access gaps in resource-constrained settings. Code, models, and data are available at https://github.com/mahmoodphd/FADA.

摘要：全球訓練有素的超聲醫生短缺限制了低收入和中等收入國家的產前超聲篩查，這些國家中有超過一半的孕婦沒有接受專業超聲檢查。當前的深度學習方法分別處理檢測、分割或分類，每個方法都需要一個單獨的模型和專家指定的標籤進行推斷。我們提出了FADA，一個基於Qwen3.5-VL的統一視覺-語言模型，通過單一的解釋優先管道執行臨床解釋、分類、檢測和分割，而不需要外部標籤。FADA通過離線預計算特徵緩存從四個特定領域的基礎模型（FetalCLIP、UltraSAM、USF-MAE、UltraFedFM）提煉知識。選擇性蒸餾僅將特徵對齊應用於標註任務，而解釋則依賴於標準微調，這在大多數評估指標上始終優於完全蒸餾。推薦變體FADA-SKD在分割中達到0.8820的平均Dice，在檢測中達到0.7671的mAP@0.50，以及100%的結構化解釋合規性。對237張圖像的專家超聲醫生驗證確認了在自主和人類參與模式下的臨床可接受輸出，其中73.5%的解釋在臨床醫生指導下得分完美。該系統可以在單個消費者GPU上進行訓練，並且可在沒有雲連接的情況下部署。我們通過在一部普通智能手機（高通Snapdragon 7 Gen 1，12 GB RAM）上運行壓縮的0.8B模型來驗證邊緣部署，使用llama.cpp進行GGUF量化，並在完全離線的情況下約60秒內完成完整的5階段管道。這為將AI輔助的胎兒評估與可攜式超聲設備整合建立了一條實用的途徑，直接解決了資源有限環境中的診斷接入差距。代碼、模型和數據可在https://github.com/mahmoodphd/FADA獲得。

Superficial Beliefs in LLM Decision-Making

2606.11016v1 by Gabriel Freedman, Francesca Toni

We ask whether large language models (LLMs) merely imitate rationales when choosing between two options, or whether their choices reflect a systematic underlying decision structure. Using synthetic binary decision settings in which models choose between profiles defined by graded attributes, we compare the attribute a model says mattered most with the attribute that best explains its choice under a behavioural model fit to prior decisions. The behavioural model predicts held-out choices well, showing that model behaviour is systematically related to the visible attributes rather than being random. However, direct self-reports and a separate score-based judge recover the behaviourally inferred driver only partially. The resulting picture is neither one of arbitrary behaviour nor one of fully articulated belief - outputs are structured enough to support prediction, but explicit reasons track the recovered driver only imperfectly. This qualitative pattern persists across prompt-order and sampling perturbations, alternative behavioural models, targeted occlusion analyses, and structurally varied decision settings. We interpret this as evidence for ``superficial belief'' in LLM decision-making: models behave as if guided by probabilistic local priorities over attributes, while having only limited verbal access to the attributes that drive their decisions.

摘要：我們詢問大型語言模型（LLMs）在選擇兩個選項時是否僅僅模仿理由，或它們的選擇是否反映出一個系統性的決策結構。使用合成的二元決策環境，在這些環境中模型在由分級屬性定義的配置之間進行選擇，我們比較模型所說的最重要屬性與在適合先前決策的行為模型下最能解釋其選擇的屬性。行為模型對保留選擇的預測效果良好，顯示模型行為與可見屬性之間存在系統性的關聯，而不是隨機的。然而，直接的自我報告和一個獨立的基於分數的評估者僅部分恢復了行為上推斷的驅動因素。最終的情況既不是任意行為，也不是完全闡述的信念——輸出結構足夠支持預測，但明確的理由僅不完全跟踪恢復的驅動因素。這種質性模式在提示順序和取樣擾動、替代行為模型、針對性遮蔽分析以及結構變化的決策環境中持續存在。我們將此解釋為大型語言模型決策中“表面信念”的證據：模型的行為似乎受到屬性的概率性局部優先級的指導，而對驅動其決策的屬性僅有有限的語言訪問。

Understanding and mitigating the risks of OpenClaw for non-technical users: A practical guide with Skill

2606.11007v1 by Junchang Zheng, Junfeng Tan, Jialiang Lin

OpenClaw has rapidly emerged as a transformative artificial intelligence (AI) agent framework, and its ability to autonomously execute complex, multi-step tasks has attracted an ever-growing and diverse user base. However, this capability comes with significant risks. While existing research has made important strides in characterizing these threats, such work is predominantly directed at technically sophisticated audiences. It remains largely inaccessible to non-technical users. This demographic now makes up an increasingly large and underserved portion of the community, yet it is these very users who most urgently need practical and straightforward guidance. In response, we bridge this gap through a series of interconnected efforts designed to lower the risk barrier for non-technical OpenClaw users. First, we identify and categorize seven core risks that OpenClaw users may encounter in daily usage, explaining each in plain language so that non-technical users can readily grasp the nature and potential consequences of these threats. Second, for each identified risk, we distill a set of corresponding defensive strategies into clear and actionable operational steps that are easy to follow. Third, to make protection even easier, we provide a companion OpenClaw Skill that automates key security configurations, enabling users to safeguard their systems with minimal manual intervention. Through this work, we demonstrate that safeguarding against the risks of intelligent agents need not be the exclusive domain of security experts, and that non-technical users can meaningfully participate in reducing these risks through simple, practical actions.

摘要：OpenClaw 已迅速崛起為一個變革性的人工智慧 (AI) 代理框架，其自主執行複雜的多步驟任務的能力吸引了越來越多樣化的用戶群體。然而，這一能力伴隨著重大風險。雖然現有研究在描述這些威脅方面取得了重要進展，但這些工作主要針對技術精湛的受眾。對於非技術用戶來說，這些研究仍然在很大程度上無法接觸。這一人群現在佔據了社區中越來越大且未被充分服務的部分，但正是這些用戶最迫切需要實用且簡單明瞭的指導。為此，我們通過一系列相互聯繫的努力來填補這一空白，旨在降低非技術 OpenClaw 用戶的風險門檻。首先，我們識別並分類了 OpenClaw 用戶在日常使用中可能遇到的七個核心風險，並用通俗易懂的語言解釋每一個，以便非技術用戶能夠輕鬆理解這些威脅的性質和潛在後果。其次，對於每個識別出的風險，我們提煉出一組相應的防禦策略，將其轉化為清晰且可操作的步驟，便於遵循。第三，為了使保護工作更為簡單，我們提供了一個 OpenClaw 技能，該技能自動化關鍵的安全配置，使用戶能夠以最小的手動干預來保護他們的系統。通過這項工作，我們展示了保護智能代理風險不必是安全專家的專屬領域，非技術用戶也能通過簡單、實用的行動有效參與降低這些風險。

Generative Explainability for Next-Generation Networks: LLM-Augmented XAI with Mutual Feature Interactions

2606.10942v1 by Kiarash Rezaei, Omran Ayoub, Sebastian Troia, Francesco Lelli, Paolo Monti, Carlos Natalino

As artificial intelligence and machine learning (AI/ML) models become integral to network operations, their lack of transparency poses a significant barrier to operator trust. Existing explainable artificial intelligence (XAI) techniques often fail to bridge this gap for non-specialists, producing technical outputs that are difficult to translate into actionable insights. This paper presents a framework specifically designed to address this shortcoming. It leverages a moderately sized large language model (LLM) and extends beyond the standard use of SHapley Additive exPlanations (SHAP) feature influence values. The framework employs a structured prompt enriched with mutual feature interaction data to generate human-understandable natural language explanations. To validate our framework, we performed an empirical evaluation on an optical quality of transmission (QoT) estimation use case with human evaluators. We collected independent performance evaluations from specialists, which showed a high inter-evaluator agreement. Compared to a state-of-the-art baseline that uses only SHAP feature influence values in a straightforward prompt, our approach improves the explanation usefulness and scope by 12.2% and 6.2%, while achieving 97.5% correctness.

摘要：隨著人工智慧和機器學習（AI/ML）模型成為網絡運營的不可或缺的一部分，它們缺乏透明度對運營商信任構成了重大障礙。現有的可解釋人工智慧（XAI）技術往往無法為非專業人士彌補這一差距，產生的技術輸出難以轉化為可行的見解。本文提出了一個專門設計來解決這一不足的框架。它利用了一個中等規模的大型語言模型（LLM），並超越了SHapley加法解釋（SHAP）特徵影響值的標準使用。該框架使用一個結構化的提示，並結合了互動特徵數據，以生成人類可理解的自然語言解釋。為了驗證我們的框架，我們對一個光學傳輸質量（QoT）估算的使用案例進行了實證評估，並邀請了人類評估者。我們收集了來自專家的獨立性能評估，顯示出高水平的評估者間一致性。與僅使用SHAP特徵影響值的先進基準相比，我們的方法在解釋的有用性和範圍上分別提高了12.2%和6.2%，同時達到了97.5%的正確率。

What Do Deepfake Speech Detectors Actually Hear?

2606.10912v1 by Vojtěch Staněk, Veronika Jirmusová, Anton Firc, Kamil Malinka, Jakub Reš, Martin Perešíni

Deepfake speech detectors often output a single score without explaining why an audio sample is flagged, where in the signal the evidence lies, or what cues drive the decision. We propose an audio-native explainability pipeline using Integrated Gradients on time-aligned self-supervised representations to localize decision evidence over time. We apply the proposed method to three WavLM-based detectors (AASIST, CA-MHFA, SLS) on ASVspoof 5 and manually annotate the highest-attribution regions to provide a semantic meaning of the most important cues. Despite similar performance, the detectors rely on different cues: AASIST emphasizes non-speech/environment cues, CA-MHFA focuses on localized phoneme artifacts, and SLS relies on word boundaries and spectral integrity. We move beyond speculative reasoning and validate our findings by causal masking of the primary detector cues. Observed performance degradation further supports the explained detector semantics.

摘要：深偽語音檢測器通常只輸出一個單一的分數，而不解釋為什麼音頻樣本被標記，證據位於信號的何處，或驅動決策的線索是什麼。我們提出了一個音頻原生的可解釋性管道，使用集成梯度對時間對齊的自我監督表示進行處理，以隨時間本地化決策證據。我們將所提出的方法應用於三個基於WavLM的檢測器（AASIST、CA-MHFA、SLS）在ASVspoof 5上，並手動標註最高歸因區域，以提供最重要線索的語義意義。儘管性能相似，這些檢測器依賴於不同的線索：AASIST強調非語音/環境線索，CA-MHFA專注於局部音素伪影，而SLS則依賴於單詞邊界和頻譜完整性。我們超越了推測性推理，通過對主要檢測器線索進行因果遮蔽來驗證我們的發現。觀察到的性能下降進一步支持了解釋的檢測器語義。

Accelerating NeurASP with vectorization and caching

2606.10787v1 by Alexander Philipp Rader, Alessandra Russo

Neurosymbolic AI combines neural networks with symbolic programs to create robust and explainable predictions. One such framework is NeurASP, which trains a neural network to predict concepts and reasons over them using rules written in answer set programming (ASP) to solve downstream tasks. Crucially, labels are only provided for the downstream prediction produced by the symbolic rules, not for the latent concepts themselves.Backpropagation through the non-differentiable ASP component requires expensive probability and gradient calculations, which has hindered scalability to more sophisticated tasks.In this paper, we address the current limitations of NeurASP by improving its computational performance through vectorization, batch processing and caching of intermediate computations during training. We compare computation speeds between the original and our new implementation of NeurASP and report speedups of multiple orders of magnitude for larger tasks. To this end, we propose a new dataset of difficult tasks involving playing cards, which we use to test the capabilities of NeurASP's enhanced learning function.

摘要：神經符號人工智慧結合了神經網絡與符號程序，以創造穩健且可解釋的預測。其中一個框架是 NeurASP，它訓練神經網絡來預測概念並基於這些概念進行推理，使用以答案集程式設計（ASP）編寫的規則來解決下游任務。關鍵是，標籤僅針對符號規則產生的下游預測提供，而不是針對潛在概念本身。通過非可微的 ASP 組件進行反向傳播需要昂貴的概率和梯度計算，這妨礙了其擴展到更複雜任務的能力。在本文中，我們通過向量化、批處理和訓練期間中間計算的緩存來改善 NeurASP 的計算性能，以解決其當前的限制。我們比較了原始 NeurASP 與我們的新實現之間的計算速度，並報告了在較大任務中多個數量級的加速。為此，我們提出了一個涉及撲克牌的困難任務的新數據集，並用它來測試 NeurASP 增強學習功能的能力。

From Data Heterogeneity to Convergence: A Data-Centric Review of Federated Learning

2606.10595v1 by Huong Nguyen, Mickaël Bettinelli, Amirhossein Ghaffari, Alexandre Benoit, Hong-Tri Nguyen, Susanna Pirttikangas, Lauri Lovén

Federated Learning (FL) has emerged as a promising solution for data hunger in centralized learning. This paradigm enables privacy with multiple clients to train a shared-task model collaboratively without exposing their local data. While being a key component in any learning system, data is also a primary source of vulnerabilities and challenges, and a major determinant of a stable and well-converged training. Existing FL reviews describe general foundations, security practices, opportunities, challenges, and applications, without delving into diverse aspects of data and considering problems from the data perspective. They rarely provide a data-lens synthesis that links concrete data properties, split protocols, and defenses to convergence speed and stability. This survey fills that gap with three advances. First, we analyze non-IID into measurable traits and rank their influence on convergence as strong, medium, or light, explaining the mechanisms behind each and reconciling evidence across images, texts, and graphs. Second, we connect experimental splitting practices to the real phenomena they emulate, expose the artifacts they introduce, and show how those artifacts affect target accuracy. Third, we analyze how data-related vulnerabilities and their proposed defenses affect convergence, reporting performance under clean and adversarial conditions to make the convergence-robustness trade-off explicit. To our knowledge, this is the first survey to provide a complete understanding of data-related challenges that govern FL. With clear takeaways distilled for each concern, our work serves as actionable guidance, helping practitioners design their system with predictable convergence and stability.

摘要：聯邦學習（FL）已成為解決集中式學習中數據需求的一個有前景的解決方案。這一範式使得多個客戶端能夠在不暴露其本地數據的情況下，共同訓練一個共享任務模型以保護隱私。數據作為任何學習系統的關鍵組成部分，同時也是脆弱性和挑戰的主要來源，並且是穩定和良好收斂訓練的主要決定因素。現有的FL評論描述了一般基礎、保安實踐、機會、挑戰和應用，但未深入探討數據的多樣性及從數據角度考慮問題。它們很少提供一個數據視角的綜合，將具體數據特性、拆分協議和防禦措施與收斂速度和穩定性聯繫起來。本調查填補了這一空白，提出了三項進展。首先，我們將非獨立同分佈（non-IID）分析為可測量的特徵，並根據其對收斂的影響將其分為強、中等或輕微，解釋每種影響背後的機制，並調和來自圖像、文本和圖表的證據。其次，我們將實驗拆分實踐與其模擬的真實現象聯繫起來，揭示它們引入的工件，並展示這些工件如何影響目標準確性。第三，我們分析與數據相關的脆弱性及其提出的防禦措施如何影響收斂，報告在乾淨和對抗條件下的性能，以明確收斂與穩健性之間的權衡。據我們所知，這是第一篇提供對統治FL的數據相關挑戰的完整理解的調查。我們的工作為每個問題提煉出清晰的要點，作為可行的指導，幫助從業者設計其系統，以實現可預測的收斂和穩定性。

Towards Critical Branching Mechanism in Recurrent Neural Networks

2606.10384v1 by Feixiang Ren, Ling Feng

Criticality has been proposed as a key organizing principle in biological neural systems, yet its origin and relevance in artificial neural networks remain unclear. We analyze hidden-state dynamics in trained long short-term memory (LSTM) networks and show that small networks near their optimal training epochs (steps) exhibit scale-free avalanche statistics and branching parameters close to unity, indicative of near-critical dynamics, while larger models remain subcritical. To explain the coexistence of subcritical branching with robust $1/f^β$ noise, we introduce a mixture branching process framework that links heterogeneous branching dynamics to long-range temporal correlations. These results identify critical-like behavior in LSTMs as an emergent, capacity-dependent dynamical regime.

摘要：關鍵性已被提出作為生物神經系統中的一個關鍵組織原則，但其在人工神經網絡中的起源和相關性仍不明確。我們分析了訓練過的長短期記憶（LSTM）網絡中的隱藏狀態動態，並顯示接近最佳訓練時期（步驟）的較小網絡顯示出無尺度雪崩統計和接近於1的分支參數，這表明接近臨界動態，而較大的模型則保持亞臨界。為了解釋亞臨界分支與穩健的 $1/f^β$ 噪聲的共存，我們引入了一種混合分支過程框架，將異質的分支動態與長程時間相關性聯繫起來。這些結果確定了LSTM中類臨界行為作為一種新興的、依賴於容量的動態範疇。

Supervised Fine-tuning with Synthetic Rationale Data Hurts Real-World Disease Prediction

2606.10279v1 by Buxin Su, Bingxuan Li, Cheng Qian, Yiwei Wang, Jin Jin, Bingxin Zhao

Supervised fine-tuning with synthetic rationale data is widely assumed to improve language model performance on clinical prediction tasks by teaching models not just what to predict but why. We test this assumption on five-year Alzheimer's disease and related dementias (ADRD) prediction from longitudinal health histories. Across a large-scale controlled experiment of 504 configurations, we find that rationale-based SFT consistently and substantially hurts prediction performance relative to label-only fine-tuning. The degradation persists across model families and data scales, and is not resolved by using a reasoning-oriented base model. Crucially, the failure is not explained by poor rationale quality: human expert annotation confirms that the generated rationales are medically accurate and faithfully grounded in patient-specific evidence, and few-shot experiments show that the same rationales improve performance when used as inference-time demonstrations rather than training targets. We identify the root cause as a structural conflict between narrative plausibility and discriminative optimization. We hope our work paves the path toward a more precise understanding of when and how rationale-based supervision helps and when it does not, guiding the responsible development of language models for high-stakes clinical prediction.

摘要：監督式微調使用合成理由數據被廣泛認為能改善語言模型在臨床預測任務上的表現，因為它教會模型不僅要預測什麼，還要知道為什麼。我們在五年阿茲海默症及相關癡呆症（ADRD）從縱向健康歷史的預測中測試了這一假設。在一項大規模受控實驗中，涉及504個配置，我們發現基於理由的SFT相較於僅使用標籤的微調，始終且顯著地損害了預測性能。這種降級在不同的模型系列和數據規模中持續存在，且使用以推理為導向的基模型並未解決此問題。關鍵是，這一失敗並不是由於理由質量差造成的：人類專家的註釋確認生成的理由在醫學上是準確的，並且忠實地基於患者特定的證據，少量樣本實驗顯示相同的理由在用作推理時的演示而非訓練目標時能改善性能。我們將根本原因確定為敘事的合理性與區分性優化之間的結構性衝突。我們希望我們的研究能為更精確地理解基於理由的監督何時以及如何有助於預測，何時又無效鋪平道路，指導高風險臨床預測語言模型的負責任發展。

Gaming AI-Assisted Peer Reviews Poses New Risks to the Scientific Community

2606.10159v1 by Lin Li, Qi Zhang, Xander Davies, Jianing Qiu, Yarin Gal

AI is increasingly used to support scientific peer review, from manuscript screening, reviewer assistance to editorial triage. Although such systems promise to reduce reviewer burden and accelerate publication, their robustness to strategic manipulation remains poorly understood. Here we show that AI-mediated peer review is vulnerable to a simple, low-cost manipulation: superficial rephrasing of the manuscript abstract. Without changing the underlying scientific content and communication, and even without knowledge of the reviewing model, adversarially rewritten abstracts substantially improve AI review outcomes. We see this across disciplines and publication venues, for both human-written and AI-generated papers. Our strongest attack achieves an attack-success-rate of about 38%, increasing acceptance ratings by +1.31 for Gemini 3 Flash reviewers and by +0.88 for GPT 5.4 Mini reviewers on a 10-point scale. When the original AI review suggests 'reject', the success rate rises to more than 50%. This effect extends beyond overall score inflation, increasing review confidence and scores on core scientific criteria such as soundness, significance and perceived contribution. The attack is practical, requiring only about 5 minutes and $1 for a 10-page AI conference submission, and is hard to distinguish from ordinary scientific editing. Inflated AI reviews could bias downstream human decision-making, shifting editorial recommendations from rejection towards acceptance. These findings reveal a general vulnerability in AI-assisted scientific evaluation: when AI-generated review influence editorial decisions, authors may be incentivized to optimize manuscripts for AI judgment rather than scientific merit. Our results suggest that AI tools should not be treated as neutral evaluators in high-stakes peer review without systematic robustness testing, transparent safeguards and careful human oversight.

摘要：AI 越來越多地被用來支持科學同行評審，從手稿篩選、審稿人協助到編輯篩選。雖然這些系統承諾減輕審稿人的負擔並加速發表，但它們對策略性操縱的穩健性仍然不甚了解。這裡我們顯示，AI 媒介的同行評審容易受到一種簡單、低成本的操縱：對手稿摘要的表面重述。在不改變基礎科學內容和交流的情況下，甚至在不知曉評審模型的情況下，對抗性重寫的摘要顯著改善了 AI 評審結果。我們在各個學科和發表場所都觀察到了這一現象，無論是人類撰寫的論文還是 AI 生成的論文。我們最強的攻擊達到了約 38% 的攻擊成功率，對於 Gemini 3 Flash 審稿人，接受率提高了 +1.31，而對於 GPT 5.4 Mini 審稿人，則提高了 +0.88，滿分為 10 分。當原始 AI 評審建議“拒絕”時，成功率上升到超過 50%。這一效果超越了整體分數膨脹，增強了對核心科學標準（如健全性、重要性和感知貢獻）的評審信心和分數。這一攻擊是實際可行的，對於一篇 10 頁的 AI 會議投稿，只需約 5 分鐘和 1 美元，且難以與普通的科學編輯區分開來。膨脹的 AI 評審可能會對下游的人類決策產生偏見，將編輯建議從拒絕轉向接受。這些發現揭示了 AI 輔助科學評估的一個普遍脆弱性：當 AI 生成的評審影響編輯決策時，作者可能會被激勵去優化手稿以符合 AI 的評判，而非科學價值。我們的結果表明，AI 工具在高風險的同行評審中不應被視為中立的評估者，而應進行系統的穩健性測試、透明的保障措施和謹慎的人類監督。

XMedFusion: A Knowledge-Guided Multimodal Perception and Reasoning Framework for Autonomous Medical Systems

2606.14766v1 by Hamza Riaz, Arham Haroon, Maha Baig, Muhammad Dawood Rizwan, Muhammad Naseer Bajwa, Muhammad Moazam Fraz

Autonomous medical and robotic systems increasingly rely on intelligent perception and reasoning capabilities to interpret visual data and support clinical decision making. Radiology report generation represents a critical component of such automated diagnostic workflows, yet existing end-to-end multimodal models often suffer from weak visual grounding, resulting in unreliable interpretations and omission of subtle clinical findings. This paper presents XMedFusion, a modular AI framework designed as an intelligent perception and reasoning module for autonomous medical systems. The proposed framework decomposes visual information into coordinated functional components that emulate expert-driven analysis, including a visual perception agent that extracts image-grounded evidence, a knowledge graph construction agent that structures clinically relevant findings, and a retrieval-guided drafting process that ensures a consistent reporting structure. A synthesis agent iteratively integrates visual and structured evidence through reasoning-driven verification to produce reliable and interpretable diagnostic outputs. Experimental evaluation on a public chest radiograph dataset demonstrates significant improvements over baseline vision-language models, achieving gains from 0.0493 to 0.3359 in BLEU-1, 0.0863 to 0.2440 in ROUGE-L, and 0.0829 to 0.1708 in METEOR, along with substantial improvements in semantic evaluation metrics such as Consistency (2.38 to 7.80) and Accuracy (2.34 to 6.93). The results highlight the effectiveness of structured multi-agent perception and reasoning for enhancing robustness, transparency, and automation in intelligent medical imaging systems, enabling integration into autonomous healthcare and robotic diagnostic workflows.

摘要：自主醫療和機器人系統越來越依賴智能感知和推理能力來解釋視覺數據並支持臨床決策。放射學報告生成是這類自動化診斷工作流程中的一個關鍵組成部分，但現有的端到端多模態模型往往存在視覺基礎薄弱的問題，導致解釋不可靠和忽略微妙的臨床發現。本文提出了XMedFusion，一個模組化的AI框架，旨在作為自主醫療系統的智能感知和推理模組。所提出的框架將視覺信息分解為協調的功能組件，模擬專家驅動的分析，包括提取圖像基礎證據的視覺感知代理、結構化臨床相關發現的知識圖譜構建代理，以及確保一致報告結構的檢索引導草擬過程。一個合成代理通過推理驅動的驗證迭代整合視覺和結構化證據，以產生可靠且可解釋的診斷輸出。對公共胸部X光數據集的實驗評估顯示，與基線視覺-語言模型相比有顯著改善，在BLEU-1上從0.0493提升至0.3359，在ROUGE-L上從0.0863提升至0.2440，在METEOR上從0.0829提升至0.1708，以及在一致性（2.38至7.80）和準確性（2.34至6.93）等語義評估指標上有顯著改善。結果突顯了結構化多代理感知和推理在增強智能醫療影像系統的穩健性、透明度和自動化方面的有效性，促進了自主醫療和機器人診斷工作流程的整合。

Hybrid Robustness Verification for Spatio-Temporal Neural Networks

2606.09746v1 by Sherwin Varghese, Matthew Wicker, Alessio Lomuscio

With AI increasingly deployed in safety-critical systems, providing formal robustness guarantees for the underlying models is essential. Existing verification methods either rely on overly conservative approximations or incur prohibitive computational costs. For example, the use of lp-norm perturbations in video settings encodes the belief that the adversary can inject noise in every video frame. In practice, adversarial perturbations exhibit structured spatial and temporal correlations, constrained to lower-dimensional, semantically meaningful subspaces. In this work, we study robustness verification of 3D CNNs processing video and volumetric inputs, targeting applications in action recognition (UCF-101), autonomous driving (Udacity), and medical imaging (MedMNIST) exploiting realistic assumptions on adversarial strength by modelling them as spatio-temporal constraints - where the attacker can modify either a subset of frames or patches within a set of consecutive frames. We demonstrate that modelling realistic constraints enables tighter approximations. We introduce Spatio-Temporal Bound Propagation (STBP), a verification framework that computes an exact closed-form characterization of the first convolutional layer and propagates certified bounds through subsequent layers using scalable approximations. Computing the exact closed form provides the tightest bounds for the first convolutional layer. Thus, we utilise approximation methods in the remainder of the network. To spur further progress in this field, we propose ST-Bench, a verification benchmark for autonomous driving and activity recognition, to systematically evaluate verifiable robustness. Compared to existing verification-based approaches, STBP provides stronger robustness guarantees with significantly improved scalability, achieving 1.7x higher certified robust accuracy under identical perturbation budgets.

摘要：隨著人工智慧越來越多地應用於安全關鍵系統，為底層模型提供正式的穩健性保證變得至關重要。現有的驗證方法要麼依賴過於保守的近似，要麼產生高昂的計算成本。例如，在視頻設置中使用 lp-norm 擾動編碼了這樣的信念：對手可以在每個視頻幀中注入噪聲。實際上，對抗性擾動顯示出結構化的空間和時間相關性，受限於較低維度的、語義上有意義的子空間。在這項工作中，我們研究了處理視頻和體積輸入的 3D CNN 的穩健性驗證，目標應用於動作識別（UCF-101）、自動駕駛（Udacity）和醫學影像（MedMNIST），通過將對抗性強度建模為時空約束來利用現實假設——攻擊者可以修改一組幀或一組連續幀中的補丁。我們證明了建模現實約束能夠實現更緊的近似。我們引入了時空邊界傳播（STBP），這是一個驗證框架，計算第一個卷積層的精確閉式形式特徵，並使用可擴展的近似方法將經過認證的邊界傳播到後續層。計算精確的閉式形式為第一個卷積層提供了最緊的邊界。因此，我們在網絡的其餘部分使用近似方法。為了促進該領域的進一步發展，我們提出了 ST-Bench，一個自動駕駛和活動識別的驗證基準，旨在系統地評估可驗證的穩健性。與現有的基於驗證的方法相比，STBP 提供了更強的穩健性保證，並顯著提高了可擴展性，在相同的擾動預算下實現了 1.7 倍更高的認證穩健準確率。

Correlation Is Not Enough: Embedding Human Metadata for Individual Causal Discovery

2606.09672v1 by Suraj Biswas, Saurabh Gupta, Pritam Mukherjee

Ask a pretrained biomedical language model whether "cortisol 28 ug/dL" and "stock-market volatility" are related, and it returns a cosine similarity of 0.83 on a scale where 1.0 means identical. The two share no mechanism. This is not a corner case: every off-the-shelf biomedical encoder we tested (BioBERT, PubMedBERT, BioM-ELECTRA) scores unrelated cross-domain pairs between 0.76 and 0.92 when the answer should be near zero. Accuracy on cross-domain discrimination is 0%. Retrieval systems survive this, because a language model downstream filters the noise. A Large Behavioural Model (LBM), a foundation model whose subject is a person rather than a sentence, does not: it reasons over a graph of a user's life and treats embedding proximity as evidence that two events are causally linked. False proximity writes a false causal edge, and everything downstream inherits the error. Here, embedding geometry is not a tuning knob; it is correctness. We report the fix. A contrastive pass over 72,034 pairs raises PubMedBERT BIOSSES correlation from 0.633 to 0.828 and within-vs-across-domain separation from 1.05x to 1.63x. A second pass, BODHI, mines hard negatives from edges absent in a biomedical knowledge graph and lifts separation to 2.30x and the discrimination gap to +0.392, at a 4.5% BIOSSES cost. On an Intel Xeon 6737P with AMX, OpenVINO cuts single-query latency from 1367 ms to 10 ms (133x) and reaches 555 sentences/sec. One finding contradicts standard advice: FP16 beats INT8 on this silicon at every serving batch size, and we explain why. The same model on a no-AMX Ice Lake instance runs 13-27x slower. We release the benchmark suite, training corpora, the BODHI generator, and the OpenVINO scripts.

摘要：詢問一個預訓練的生物醫學語言模型「皮質醇 28 ug/dL」和「股市波動性」是否相關，它返回的餘弦相似度為 0.83，該比例的範圍是 1.0 表示完全相同。這兩者之間沒有任何機制。這並不是一個邊緣案例：我們測試的每一個現成的生物醫學編碼器（BioBERT、PubMedBERT、BioM-ELECTRA）在應該接近零的情況下，對於不相關的跨領域對的得分介於 0.76 和 0.92 之間。跨領域辨識的準確率為 0%。
檢索系統能夠在這種情況下生存，因為下游的語言模型過濾了噪音。一個大型行為模型（LBM），一個以人而非句子為主題的基礎模型，則無法做到：它對用戶生活的圖進行推理，並將嵌入的接近性視為兩個事件因果連結的證據。虛假的接近性寫下虛假的因果邊，所有下游的內容都繼承了這個錯誤。在這裡，嵌入幾何不是調整旋鈕；它是正確性。
我們報告了修正方案。對 72,034 對的對比通過將 PubMedBERT BIOSSES 的相關性從 0.633 提高到 0.828，並將領域內與跨領域的分離從 1.05 倍提高到 1.63 倍。第二次通過，BODHI，從生物醫學知識圖中缺失的邊緣挖掘難負樣本，並將分離提高到 2.30 倍，辨識差距提高到 +0.392，BIOSSES 成本為 4.5%。在搭載 AMX 的 Intel Xeon 6737P 上，OpenVINO 將單查詢延遲從 1367 毫秒降低到 10 毫秒（133 倍），每秒達到 555 句子。一個發現與標準建議相悖：在這種矽片上，FP16 在每個服務批次大小上都優於 INT8，我們解釋了原因。同一模型在沒有 AMX 的 Ice Lake 實例上運行速度慢 13-27 倍。我們發布了基準套件、訓練語料庫、BODHI 生成器和 OpenVINO 腳本。

Transition-Based Digital Twin Modelling for Alzheimer's Disease under Sparse Longitudinal Data

2606.09671v1 by Yinyu Huang, Yilin Zhang, Sofia Michopoulou, Christopher Kipps, Rahman Attar

Alzheimer's disease (AD) progression is highly heterogeneous and is typically observed through sparse and irregular longitudinal data, posing challenges for prediction and personalised monitoring. Existing machine learning approaches have improved AD prediction using multimodal data, yet often focus on static classification or cohort-level risk estimation, providing limited support for subject-specific modelling and uncertainty-aware reasoning. To address these limitations, we present a personalised digital twin framework for AD prediction and scenario-based analysis using multimodal longitudinal data. The proposed approach integrates complementary modelling strategies to capture clinical transitions and temporal dependencies across visits. Using data from the Alzheimer's Disease Neuroimaging Initiative (ADNI), including cognitive assessments, clinical variables, and MRI-derived phenotypes, the framework predicts cognitive status and diagnostic categories while quantifying predictive uncertainty and enabling patient-specific what-if trajectory analysis. Evaluation on leak-free subject-level splits demonstrates strong performance in score forecasting and diagnosis classification. In this sparse and irregular ADNI setting, transition-based modelling of adjacent visits achieved higher predictive accuracy than the sequence-based branch, suggesting that local transition modelling may be more data-efficient. While sequence models remain valuable for uncertainty-aware trajectory forecasting, local transition modelling offers a more data-efficient and robust predictive strategy. These findings highlight the importance of aligning temporal modelling strategies with clinical data structure and suggest that transition-based digital twin formulations may provide a practical and interpretable approach for personalised disease forecasting in neurodegenerative disorders.

摘要：阿茲海默病（AD）的進展高度異質，通常通過稀疏且不規則的縱向數據觀察，這對預測和個性化監測帶來挑戰。現有的機器學習方法利用多模態數據改善了AD的預測，但通常專注於靜態分類或隊列層級的風險估計，對於特定個體的建模和不確定性意識推理提供的支持有限。為了解決這些限制，我們提出了一個個性化數位雙胞胎框架，用於AD預測和基於情境的分析，使用多模態縱向數據。該方法整合了互補的建模策略，以捕捉臨床轉變和訪問之間的時間依賴性。利用阿茲海默病神經影像倡議（ADNI）的數據，包括認知評估、臨床變量和MRI衍生的表型，該框架預測認知狀態和診斷類別，同時量化預測不確定性並啟用患者特定的假設性軌跡分析。在無洩漏的個體層級拆分評估中，顯示出在分數預測和診斷分類方面的強大表現。在這個稀疏且不規則的ADNI環境中，相鄰訪問的基於轉變的建模實現了比基於序列的分支更高的預測準確性，這表明局部轉變建模可能更具數據效率。儘管序列模型對於不確定性意識的軌跡預測仍然有價值，但局部轉變建模提供了一種更具數據效率和穩健性的預測策略。這些發現突顯了將時間建模策略與臨床數據結構對齊的重要性，並建議基於轉變的數位雙胞胎公式可能為神經退行性疾病中的個性化疾病預測提供一種實用且可解釋的方法。

Self-Explainability in Self-Adaptive and Self-Organising Systems: Status and Research Directions

2606.09568v1 by Tom Beyer, Svea Wisy, Sven Tomforde

The growing complexity of self-adaptive and self-organising systems, fuelled by advances in Artificial Intelligence (AI), has made them increasingly difficult to understand and trust. While Explainable AI aims to provide insight into AI decision-making, a more advanced goal is for systems to explain themselves - an ability referred to as Self-Explainability (SX). This article presents a systematic literature review on SX, analysing existing approaches, including their domains, targets, and evaluation methods. The review develops a unified definition and taxonomy of SX and introduces Levels of Self-Explainability, providing a framework for positioning current and future research. Our results show that most SX approaches remain conceptual, with few practical implementations. Moreover, there is currently no formal or de facto standard for evaluating SX, highlighting a major research gap. This work thus establishes a foundation and roadmap for advancing Self-Explainability in complex systems.

摘要：自適應和自組織系統的日益複雜性，受到人工智慧（AI）進步的推動，使得這些系統越來越難以理解和信任。雖然可解釋的AI旨在提供對AI決策過程的洞察，但更高級的目標是讓系統能夠自我解釋——這種能力被稱為自我解釋性（SX）。本文呈現了一項關於SX的系統文獻回顧，分析了現有的方法，包括它們的領域、目標和評估方法。該回顧發展了一個統一的SX定義和分類法，並引入了自我解釋性的層級，提供了一個定位當前和未來研究的框架。我們的結果顯示，大多數SX方法仍然是概念性的，實際應用很少。此外，目前沒有正式或事實上的評估SX標準，這突顯了一個主要的研究空白。因此，本研究為推進複雜系統中的自我解釋性奠定了基礎和路線圖。

Capacity, Not Format: Rethinking Structured Reasoning Failures

2606.09410v1 by Hengxin Fan

Prior work treats structured output as a reasoning tax, but this framing is incomplete: the cost of formatting depends strongly on a model's spare capacity. Using information-matched prose controls and a four-level schema complexity gradient, we separate format-specific effects from prompt-length confounds across 4 models and 5 benchmarks with 0% parse failures on successfully generated responses. We find that structured formats are capacity-dependent. Models with sufficient headroom absorb JSON constraints without degradation (Sonnet: $88.7\pm4.0$% JSON vs. $89.3\pm1.7$% CoT on MATH-Hard). In contrast, formats severely degrade models operating near their limits through two distinct mechanisms. First, under standard token budgets, Haiku drops 36.2pp ($p < 0.0001$) largely due to truncation. Second, even with extended budgets eliminating truncation, GPT-4o-mini drops 28.0pp ($p < 0.001$), revealing pure capacity competition independent of token exhaustion. This format penalty scales with schema complexity (McNemar $p < 0.0001$) and cannot be explained by prompt length alone. Furthermore, these results qualify claims of frontier model immunity: on AIME competition math, Opus 4.7 drops from 96.2% to 91.0% under JSON ($-5.3$pp; the displayed percentages are independently rounded, exact difference is $7/133 = 5.26$pp $\approx 5.3$pp). A delayed-structure ablation -- reasoning freely before formatting -- recovers most of the lost accuracy (3-run mean: 80--87%), supporting the capacity competition mechanism. The practical implication is not to avoid structured output, but to match it to capacity: when a model is near its limits, think first, format later.

摘要：先前的研究將結構化輸出視為推理稅，但這種框架並不完整：格式化的成本強烈依賴於模型的空閒容量。使用信息匹配的散文控制和四級架構複雜度梯度，我們在4個模型和5個基準上分離格式特定效應與提示長度的混淆，成功生成的回應中解析失敗率為0%。我們發現結構化格式依賴於容量。具有足夠頭部空間的模型能夠在不降級的情況下吸收JSON約束（Sonnet: $88.7\pm4.0$% JSON對比$89.3\pm1.7$% CoT在MATH-Hard上）。相對而言，格式會通過兩種不同的機制嚴重降級運行在其極限附近的模型。首先，在標準的標記預算下，Haiku下降了36.2個百分點（$p < 0.0001$），主要是由於截斷。其次，即使在消除截斷的擴展預算下，GPT-4o-mini仍下降28.0個百分點（$p < 0.001$），顯示出純粹的容量競爭，與標記耗盡無關。這種格式懲罰隨著架構複雜度的增加而增加（McNemar $p < 0.0001$），且無法僅通過提示長度來解釋。此外，這些結果質疑了前沿模型免疫的說法：在AIME競賽數學中，Opus 4.7在JSON下從96.2%下降至91.0%（$-5.3$pp；顯示的百分比是獨立四捨五入的，確切差異為$7/133 = 5.26$pp $\approx 5.3$pp）。延遲結構消融——在格式化之前自由推理——恢復了大部分失去的準確性（3次運行均值：80--87%），支持容量競爭機制。實際的含義不是避免結構化輸出，而是將其與容量匹配：當模型接近其極限時，先思考，後格式化。

Disentangling Hallucinations: Orthogonal Semantic Projection for Robust Interpretability

2606.14758v1 by Emirhan Bilgiç, Baptiste Caramiaux, Zhi Yan, Gianni Franchi

As Vision-Language Models are increasingly deployed in safety-critical applications, the trustworthiness of their explanations becomes crucial. Explainable AI (XAI) methods for Vision-Language Models often suffer from semantic hallucination, where attribution maps highlight prominent image regions even when prompted with incorrect text descriptions (e.g., highlighting a dog when prompted ``cat''). Although this problem is widespread, a formal mathematical analysis of XAI methods and CLIP embeddings is largely missing in the literature. We demonstrate that this phenomenon is not specific to a single architecture but is a fundamental consequence of Linear Semantic Leakage in high-dimensional embedding spaces. We propose a unified theoretical framework, Linear Semantic Attribution (LSA), which generalizes across discriminative methods. We introduce OSP, a geometric intervention that utilizes the residual property of OMP to disentangle unique semantic signals from shared concepts. We prove theoretically and demonstrate empirically that OSP minimizes hallucination by orthogonalizing the query vector against distractor concepts, rendering the attribution model blind to shared features while preserving fidelity for correct prompts. Our code is available at: https://github.com/emirhanbilgic/Orthogonal-Semantic-Projection

摘要：隨著視覺-語言模型在安全關鍵應用中的逐漸部署，其解釋的可信度變得至關重要。針對視覺-語言模型的可解釋人工智慧（XAI）方法常常遭遇語義幻覺的問題，即使在錯誤的文本描述下（例如，在提示「貓」時突出顯示一隻狗），歸因圖仍然會強調顯著的圖像區域。儘管這個問題普遍存在，但文獻中對XAI方法和CLIP嵌入的正式數學分析卻大多缺失。我們展示了這一現象並非特定於單一架構，而是高維嵌入空間中線性語義洩漏的根本結果。我們提出了一個統一的理論框架，線性語義歸因（LSA），它在區別性方法之間進行了概括。我們引入了OSP，一種幾何干預，利用OMP的殘差性質來解開獨特的語義信號與共享概念。我們理論上證明並實證表明，OSP通過將查詢向量正交化以抵消干擾概念來最小化幻覺，使得歸因模型對共享特徵失明，同時保留對正確提示的忠實度。我們的代碼可在以下網址獲得：https://github.com/emirhanbilgic/Orthogonal-Semantic-Projection

TRIAGE: Dialectical Reasoning for Explainable Risk Prediction on Irregularly Sampled Medical Time Series with LLMs

2606.09030v1 by Hyeongwon Jang, Gyouk Chu, Changhun Kim, Joonhyung Park, Hangyul Yoon, Eunho Yang

Clinical early warning systems built on electronic health records, in which clinical observations are recorded as irregularly sampled medical time series (ISMTS), must deliver both calibrated risk scores for patient triage and interpretable rationales that clinicians can verify. Large Language Models (LLMs) have been explored for this task, yet they collapse graded clinical risk into overconfident binary predictions. This risk polarization undermines both calibration and cross-patient comparability. To address this, we propose TRIAGE, a framework that trains an LLM to generate dialectical reasoning over competing clinical outcomes by eliciting outcome-specific rationales. This dialectical formulation mitigates risk polarization, enabling a single LLM to yield continuous risk scores grounded in explicit clinical reasoning. Evaluated on three ISMTS benchmarks, TRIAGE achieves an average AUPRC improvement of 3.3% and reduces calibration error by 81% compared to the competitive baselines. An LLM-as-a-judge assessment further shows that our rationales surpass post-hoc explanations from the baseline by 20% in clinical reasoning quality. The source code is available at https://github.com/HyeongWon-Jang/TRIAGE .

摘要：臨床早期預警系統建立在電子健康記錄之上，其中臨床觀察被記錄為不規則取樣的醫療時間序列（ISMTS），必須提供經過校準的風險評分以進行病人分流，並且提供臨床醫生可以驗證的可解釋的理由。大型語言模型（LLMs）已被探索用於這項任務，但它們將分級的臨床風險簡化為過於自信的二元預測。這種風險極化削弱了校準和跨病人可比性。為了解決這個問題，我們提出了TRIAGE，一個框架，訓練LLM生成關於競爭臨床結果的辯證推理，通過引出特定結果的理由。這種辯證形式減輕了風險極化，使單一LLM能夠產生基於明確臨床推理的連續風險評分。在三個ISMTS基準上進行評估，TRIAGE實現了平均AUPRC提高3.3%，並將校準誤差降低81%，與競爭基準相比。LLM作為評判的評估進一步顯示，我們的理由在臨床推理質量上超過基準的事後解釋20%。源代碼可在https://github.com/HyeongWon-Jang/TRIAGE 獲得。

Understanding Quantization-Aware Training: Gradients at Quantized Weights Bias to the Low-Loss Basin

2606.09012v1 by Hanyang Li, Jianhao Ma, Ying Cui

Post-training quantization (PTQ) converts a trained full-precision model into low-bit weights without task-level retraining, while quantization-aware training (QAT) incorporates quantization into the training loop. Although PTQ is efficient and often accurate at moderate bitwidths, it can fail sharply at aggressive bitwidths; QAT is more expensive but can often recover the lost accuracy. We propose a unified geometric framework that explains both PTQ failure and QAT recovery. We model full-precision training as following a low-loss \emph{river} inside a wider \emph{valley}: a normal neighborhood of the river forms a nearly flat \emph{basin}, while leaving this basin incurs a sharp loss increase. When the quantization grid is comparable to the basin width, local PTQ objectives, including rounding and Hessian-based second-order reconstruction, can select a high-loss deployed quantized point outside the basin even when nearby low-loss quantized points exist. In this regime, straight-through-estimator-based QAT has a useful bias: it evaluates gradients at the deployed quantized weights while updating latent full-precision weights, causing the gradient to sense the valley wall and acquire an inward component that steers subsequent quantized iterates back into the basin. We formalize this mechanism through a local landscape model, construct a geometric PTQ failure mode, and prove finite-time QAT recovery under local quantizer-compatibility assumptions. Experiments across vision and language models under multiple neural-network quantization schemes corroborate the predicted basin-crossing failure of PTQ and the corresponding recovery mechanism of QAT.

摘要：後訓練量化（PTQ）將訓練好的全精度模型轉換為低位權重，而無需進行任務級別的重新訓練，量化感知訓練（QAT）則將量化納入訓練循環中。雖然PTQ在中等位寬下效率高且通常準確，但在激進的位寬下可能會急劇失敗；QAT成本更高，但通常能夠恢復失去的準確性。我們提出了一個統一的幾何框架，解釋PTQ失敗和QAT恢復。我們將全精度訓練建模為遵循一條低損失的\emph{河流}，該河流位於一個更寬的\emph{山谷}內：河流的正常鄰域形成一個幾乎平坦的\emph{盆地}，而離開這個盆地會導致損失急劇增加。當量化網格與盆地寬度相當時，局部PTQ目標，包括四捨五入和基於Hessian的二階重建，可能會選擇一個高損失的量化點，即使附近存在低損失的量化點。在這種情況下，基於直通估計器的QAT具有有用的偏差：它在更新潛在的全精度權重時，評估部署的量化權重的梯度，導致梯度感知到山谷的牆壁並獲得一個向內的分量，將隨後的量化迭代引導回盆地。我們通過局部景觀模型形式化這一機制，構建幾何PTQ失敗模式，並在局部量化器兼容性假設下證明有限時間內的QAT恢復。在多種神經網絡量化方案下的視覺和語言模型實驗證實了PTQ預測的跨盆地失敗及QAT相應的恢復機制。

Medical explainable AI