Skip to content

Medical explainable AI

Medical explainable AI

Publish Date Title Authors Homepage Code
2026-04-24 Rethinking XAI Evaluation: A Human-Centered Audit of Shapley Benchmarks in High-Stakes Settings Inês Oliveira e Silva et.al. 2604.22662v1 null
2026-04-24 Hidden Failure Modes of Gradient Modification under Adam in Continual Learning, and Adaptive Decoupled Moment Routing as a Repair Yuelin Hu et.al. 2604.22407v1 null
2026-04-24 Tell Me Why: Designing an Explainable LLM-based Dialogue System for Student Problem Behavior Diagnosis Zhilin Fan et.al. 2604.22237v1 null
2026-04-24 Reliable Self-Harm Risk Screening via Adaptive Multi-Agent LLM Systems Meghana Karnam et.al. 2604.22154v1 null
2026-04-23 Spontaneous Persuasion: An Audit of Model Persuasiveness in Everyday Conversations Nalin Poungpeth et.al. 2604.22109v1 null
2026-04-23 Optimal Question Selection from a Large Question Bank for Clinical Field Recovery in Conversational Psychiatric Intake Guan Gui et.al. 2604.22067v1 null
2026-04-23 Reliability Auditing for Downstream LLM tasks in Psychiatry: LLM-Generated Hospitalization Risk Scores Shevya Pandya et.al. 2604.22063v1 null
2026-04-23 H-Sets: Hessian-Guided Discovery of Set-Level Feature Interactions in Image Classifiers Ayushi Mehrotra et.al. 2604.22045v1 null
2026-04-23 EgoMAGIC- An Egocentric Video Field Medicine Dataset for Training Perception Algorithms Brian VanVoorst et.al. 2604.22036v1 null
2026-04-23 Shared Lexical Task Representations Explain Behavioral Variability In LLMs Zhuonan Yang et.al. 2604.22027v1 null
2026-04-23 Fine-Grained Perspectives: Modeling Explanations with Annotator-Specific Rationales Olufunke O. Sarumi et.al. 2604.21667v1 null
2026-04-23 Task-specific Subnetwork Discovery in Reinforcement Learning for Autonomous Underwater Navigation Yi-Ling Liu et.al. 2604.21640v1 null
2026-04-23 On the Role of Preprocessing and Memristor Dynamics in Reservoir Computing for Image Classification Rishona Daniels et.al. 2604.21602v1 null
2026-04-23 Dynamical Priors as a Training Objective in Reinforcement Learning Sukesh Subaharan et.al. 2604.21464v1 null
2026-04-23 Trustworthy Clinical Decision Support Using Meta-Predicates and Domain-Specific Languages Michael Bouzinier et.al. 2604.21263v1 null
2026-04-22 Agentic AI for Personalized Physiotherapy: A Multi-Agent Framework for Generative Video Training and Real-Time Pose Correction Abhishek Dharmaratnakar et.al. 2604.21154v1 null
2026-04-22 Propensity Inference: Environmental Contributors to LLM Behaviour Olli Järviniemi et.al. 2604.21098v1 null
2026-04-22 SGD at the Edge of Stability: The Stochastic Sharpness Gap Fangshuo Liao et.al. 2604.21016v1 null
2026-04-22 Convergent Evolution: How Different Language Models Learn Similar Number Representations Deqing Fu et.al. 2604.20817v1 null
2026-04-22 Automatic Ontology Construction Using LLMs as an External Layer of Memory, Verification, and Planning for Hybrid Intelligent Systems Pavel Salovskii et.al. 2604.20795v1 null
2026-04-22 Can "AI" Be a Doctor? A Study of Empathy, Readability, and Alignment in Clinical LLMs Mariano Barone et.al. 2604.20791v1 null
2026-04-22 Coverage, Not Averages: Semantic Stratification for Trustworthy Retrieval Evaluation Andrew Klearman et.al. 2604.20763v1 null
2026-04-22 Participatory provenance as representational auditing for AI-mediated public consultation Sachit Mahajan et.al. 2604.20711v1 null
2026-04-22 RSRCC: A Remote Sensing Regional Change Comprehension Benchmark Constructed via Retrieval-Augmented Best-of-N Ranking Roie Kazoom et.al. 2604.20623v1 null
2026-04-22 Evian: Towards Explainable Visual Instruction-tuning Data Auditing Zimu Jia et.al. 2604.20544v1 null
2026-04-22 MedSkillAudit: A Domain-Specific Audit Framework for Medical Research Agent Skills Yingyong Hou et.al. 2604.20441v1 null
2026-04-22 Surrogate modeling for interpreting black-box LLMs in medical predictions Changho Han et.al. 2604.20331v2 null
2026-04-22 Stateless Decision Memory for Enterprise AI Agents Vasundra Srinivasan et.al. 2604.20158v1 null
2026-04-21 From Fuzzy to Formal: Scaling Hospital Quality Improvement with AI Patrick Vossler et.al. 2604.20055v1 null
2026-04-21 TriEx: A Game-based Tri-View Framework for Explaining Internal Reasoning in Multi-Agent LLMs Ziyi Wang et.al. 2604.20043v1 null
2026-04-21 Cross-Model Consistency of AI-Generated Exercise Prescriptions: A Repeated Generation Study Across Three Large Language Models Kihyuk Lee et.al. 2604.19598v2 null
2026-04-21 Integrating Anomaly Detection into Agentic AI for Proactive Risk Management in Human Activity Farbod Zorriassatine et.al. 2604.19538v1 null
2026-04-21 EVPO: Explained Variance Policy Optimization for Adaptive Critic Utilization in LLM Post-Training Chengjun Pan et.al. 2604.19485v1 null
2026-04-21 Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents Vasundra Srininvasan et.al. 2604.19457v1 null
2026-04-21 TACENR: Task-Agnostic Contrastive Explanations for Node Representations Vasiliki Papanikou et.al. 2604.19372v1 null
2026-04-21 Beyond Semantic Similarity: A Component-Wise Evaluation Framework for Medical Question Answering Systems with Health Equity Implications Abu Noman Md Sakib et.al. 2604.19281v1 null
2026-04-20 Gradient-Based Program Synthesis with Neurally Interpreted Languages Matthew V. Macfarlane et.al. 2604.18907v1 null
2026-04-20 AI scientists produce results without reasoning scientifically Martiño Ríos-García et.al. 2604.18805v1 null
2026-04-20 Handling and Interpreting Missing Modalities in Patient Clinical Trajectories via Autoregressive Sequence Modeling Andrew Wang et.al. 2604.18753v1 null
2026-04-20 On the Importance and Evaluation of Narrativity in Natural Language AI Explanations Mateusz Cedro et.al. 2604.18311v1 null
2026-04-20 Toward Zero-Egress Psychiatric AI: On-Device LLM Deployment for Privacy-Preserving Mental Health Decision Support Eranga Bandara et.al. 2604.18302v1 null
2026-04-20 Rabies diagnosis in low-data settings: A comparative study on the impact of data augmentation and transfer learning Khalil Akremi et.al. 2604.19823v1 null
2026-04-20 ExAI5G: A Logic-Based Explainable AI Framework for Intrusion Detection in 5G Networks Saeid Sheikhi et.al. 2604.18052v1 null
2026-04-20 First, Do No Harm (With LLMs): Mitigating Racial Bias via Agentic Workflows Sihao Xing et.al. 2604.18038v1 null
2026-04-20 How Much Cache Does Reasoning Need? Depth-Cache Tradeoffs in KV-Compressed Transformers Xiao Wang et.al. 2604.17935v1 null
2026-04-20 AI Approach for MRI-only Full-Spine Vertebral Segmentation and 3D Reconstruction in Paediatric Scoliosis Nathasha Naranpanawa et.al. 2604.17846v1 null
2026-04-20 Community-Led AI Integration for Wildfire Risk Assessment: A Participatory AI Literacy and Explainability Integration (PALEI) Framework in Los Angeles, CA Sanaz Sadat Hosseini et.al. 2604.17755v1 null
2026-04-20 MHSafeEval: Role-Aware Interaction-Level Evaluation of Mental Health Safety in Large Language Models Suhyun Lee et.al. 2604.17730v1 null
2026-04-20 Semantic Entanglement in Vector-Based Retrieval: A Formal Framework and Context-Conditioned Disentanglement Pipeline for Agentic RAG Systems Nick Loghmani et.al. 2604.17677v1 null
2026-04-19 On The Mathematics of the Natural Physics of Optimization I. M. Ross et.al. 2604.17645v1 null
2026-04-19 STEP-PD: Stage-Aware and Explainable Parkinson's Disease Severity Classification Using Multimodal Clinical Assessments Md Mezbahul Islam et.al. 2604.17611v1 null
2026-04-19 CDSA-Net:Collaborative Decoupling of Vascular Structure and Background for High-Fidelity Coronary Digital Subtraction Angiography Si Li et.al. 2604.17208v1 null
2026-04-19 Persona-Based Requirements Engineering for Explainable Multi-Agent Educational Systems: A Scenario Simulator for Clinical Reasoning Training Weibing Zheng et.al. 2604.17186v1 null
2026-04-18 Abstain-R1: Calibrated Abstention and Post-Refusal Clarification via Verifiable RL Skylar Zhai et.al. 2604.17073v1 null
2026-04-18 Hybrid Quantum Neural Networks for Enhanced Breast Cancer Thermographic Classification: A Novel Quantum-Classical Integration Approach Riza Alaudin Syah et.al. 2604.16953v1 null
2026-04-18 LLMs can persuade only psychologically susceptible humans on societal issues, via trust in AI and emotional appeals, amid logical fallacies Alexis Carrillo et.al. 2604.16935v1 null
2026-04-18 The Reliance Negotiation Framework: A Dynamic Process Model of Student LLM Engagement in Academic Writing Shahin Hossain et.al. 2604.16772v1 null
2026-04-17 Why Training-Free Token Reduction Collapses: The Inherent Instability of Pairwise Scoring Signals Yang Shanglin et.al. 2604.16745v1 null
2026-04-17 CT Open: An Open-Access, Uncontaminated, Live Platform for the Open Challenge of Clinical Trial Outcome Prediction Jianyou Wang et.al. 2604.16742v1 null
2026-04-17 When Agents Go Quiet: Output Generation Capacity and Format-Cost Separation for LLM Document Synthesis Justice Owusu Agyemang et.al. 2604.16736v1 null
2026-04-17 Agentic Large Language Models for Training-Free Neuro-Radiological Image Analysis Ayhan Can Erdur et.al. 2604.16729v1 null
2026-04-17 The Query Channel: Information-Theoretic Limits of Masking-Based Explanations Erciyes Karakaya et.al. 2604.16689v1 null
2026-04-17 Using Large Language Models and Knowledge Graphs to Improve the Interpretability of Machine Learning Models in Manufacturing Thomas Bayer et.al. 2604.16280v1 null
2026-04-17 MARCH: Multi-Agent Radiology Clinical Hierarchy for CT Report Generation Yi Lin et.al. 2604.16175v1 null
2026-04-17 Can LLMs Understand the Impact of Trauma? Costs and Benefits of LLMs Coding the Interviews of Firearm Violence Survivors Jessica H. Zhu et.al. 2604.16132v1 null
2026-04-17 Dual-Modal Lung Cancer AI: Interpretable Radiology and Microscopy with Clinical Risk Integration Baramee Sukumal et.al. 2604.16104v1 null
2026-04-17 Towards Intrinsic Interpretability of Large Language Models:A Survey of Design Principles and Architectures Yutong Gao et.al. 2604.16042v2 null
2026-04-17 Evaluating Temporal and Structural Anomaly Detection Paradigms for DDoS Traffic Yasmin Souza Lima et.al. 2604.16575v1 null
2026-04-17 Towards Rigorous Explainability by Feature Attribution Olivier Létoffé et.al. 2604.15898v1 null
2026-04-17 Closing the Theory-Practice Gap in Spiking Transformers via Effective Dimension Dongxin Guo et.al. 2604.15769v1 null
2026-04-17 LLM Reasoning Is Latent, Not the Chain of Thought Wenshuo Wang et.al. 2604.15726v1 null
2026-04-16 LLM attribution analysis across different fine-tuning strategies and model scales for automated code compliance Jack Wei Lun Shi et.al. 2604.15589v1 null
2026-04-16 Towards Reliable Testing of Machine Unlearning Anna Mazhar et.al. 2604.16536v1 null
2026-04-16 Beyond Attack Success Rate: A Multi-Metric Evaluation of Adversarial Transferability in Medical Imaging Models Emily Curl et.al. 2604.16532v1 null
2026-04-16 DeepER-Med: Advancing Deep Evidence-Based Research in Medicine Through Agentic AI Zhizheng Wang et.al. 2604.15456v1 null
2026-04-16 RadAgent: A tool-using AI agent for stepwise interpretation of chest computed tomography Mélanie Roschewitz et.al. 2604.15231v1 null
2026-04-16 Expert-Annotated Embryo Image Dataset with Natural Language Descriptions for Evidence-Based Patient Communication in IVF Nicklas Neu et.al. 2604.16528v1 null
2026-04-16 Agentic Explainability at Scale: Between Corporate Fears and XAI Needs Yomna Elsayed et.al. 2604.14984v1 null
2026-04-16 Hybrid Decision Making via Conformal VLM-generated Guidance Debodeep Banerjee et.al. 2604.14980v2 null
2026-04-16 Can LLMs Score Medical Diagnoses and Clinical Reasoning as well as Expert Panels? Amy Rouillard et.al. 2604.14892v2 null
2026-04-16 M2-PALE: A Framework for Explaining Multi-Agent MCTS--Minimax Hybrids via Process Mining and LLMs Yiyu Qian et.al. 2604.14687v1 null
2026-04-16 Analyzing Chain of Thought (CoT) Approaches in Control Flow Code Deobfuscation Tasks Seyedreza Mohseni et.al. 2604.15390v2 null
2026-04-16 Rethinking Patient Education as Multi-turn Multi-modal Interaction Zonghai Yao et.al. 2604.14656v1 null
2026-04-16 CoDaS: AI Co-Data-Scientist for Biomarker Discovery via Wearable Sensors Yubin Kim et.al. 2604.14615v1 null
2026-04-16 Generative Augmented Inference Cheng Lu et.al. 2604.14575v1 null
2026-04-16 Perspective on Bias in Biomedical AI: Preventing Downstream Healthcare Disparities Michal Rosen-Zvi et.al. 2604.14514v1 null
2026-04-15 When PCOS Meets Eating Disorders: An Explainable AI Approach to Detecting the Hidden Triple Burden Apoorv Prasad et.al. 2604.14356v1 null
2026-04-15 Faithfulness Serum: Mitigating the Faithfulness Gap in Textual Explanations of LLM Decisions via Attribution Guidance Bar Alon et.al. 2604.14325v1 null
2026-04-15 Seeing Through Experts Eyes A Foundational Vision Language Model Trained on Radiologists Gaze and Reasoning Kinhei Lee et.al. 2604.14316v1 null
2026-04-15 EuropeMedQA Study Protocol: A Multilingual, Multimodal Medical Examination Dataset for Language Model Evaluation Francesco Andrea Causio et.al. 2604.14306v2 null
2026-04-15 Quantum-inspired tensor networks in machine learning models Guillermo Valverde et.al. 2604.14287v1 null
2026-04-15 Applied Explainability for Large Language Models: A Comparative Study Venkata Abhinandan Kancharla et.al. 2604.15371v1 null
2026-04-15 Med-CAM: Minimal Evidence for Explaining Medical Decision Making Pirzada Suhail et.al. 2604.13695v1 null
2026-04-15 Learning from Change: Predictive Models for Incident Prevention in a Regulated IT Environment Eileen Kapel et.al. 2604.13462v1 null
2026-04-15 Interpretable and Explainable Surrogate Modeling for Simulations: A State-of-the-Art Survey and Perspectives on Explainable AI for Decision-Making Pramudita Satria Palar et.al. 2604.14240v1 null
2026-04-15 ReSS: Learning Reasoning Models for Tabular Data Prediction via Symbolic Scaffold Chenlang Yi et.al. 2604.13392v1 null
2026-04-15 Young people's perceptions and recommendations for conversational generative artificial intelligence in youth mental health Adam Poulsen et.al. 2604.13381v1 null
2026-04-14 Explainable Fall Detection for Elderly Care via Temporally Stable SHAP in Skeleton-Based Human Activity Recognition Mohammad Saleh et.al. 2604.13279v1 null
2026-04-14 Hessian-Enhanced Token Attribution (HETA): Interpreting Autoregressive LLMs Vishal Pramanik et.al. 2604.13258v1 null
2026-04-14 Explainable Graph Neural Networks for Interbank Contagion Surveillance: A Regulatory-Aligned Framework for the U.S. Banking Sector Mohammad Nasir Uddin et.al. 2604.14232v1 null

Abstracts

Rethinking XAI Evaluation: A Human-Centered Audit of Shapley Benchmarks in High-Stakes Settings

2604.22662v1 by Inês Oliveira e Silva, Sérgio Jesus, Iker Perez, Rita P. Ribeiro, Carlos Soares, Hugo Ferreira, Pedro Bizarro

Shapley values are a cornerstone of explainable AI, yet their proliferation into competing formulations has created a fragmented landscape with little consensus on practical deployment. While theoretical differences are well-documented, evaluation remains reliant on quantitative proxies whose alignment with human utility is unverified. In this work, we use a unified amortized framework to isolate semantic differences between eight Shapley variants under the low-latency constraints of operational risk workflows. We conduct a large-scale empirical evaluation across four risk datasets and a realistic fraud-detection environment involving professional analysts and 3,735 case reviews. Our results reveal a fundamental misalignment: standard quantitative metrics, such as sparsity and faithfulness, are decoupled from human-perceived clarity and decision utility. Furthermore, while no formulation improved objective analyst performance, explanations consistently increased decision confidence, signaling a critical risk of automation bias in high-stakes settings. These findings suggest that current evaluation proxies are insufficient for predicting downstream human impact, and we provide evidence-based guidance for selecting formulations and metrics in operational decision systems.

摘要:Shapley 值是可解釋人工智慧的基石,但其在競爭性公式中的普及導致了一個支離破碎的格局,對於實際部署幾乎沒有共識。雖然理論差異已被充分記錄,但評估仍然依賴於量化代理,其與人類效用的對應關係尚未得到驗證。在本研究中,我們使用統一的攤銷框架來隔離八種 Shapley 變體之間的語義差異,並考慮到操作風險工作流程的低延遲限制。我們在四個風險數據集和一個涉及專業分析師及 3,735 個案例審查的現實欺詐檢測環境中進行了大規模的實證評估。我們的結果揭示了一個根本性的錯位:標準的量化指標,例如稀疏性和忠實度,與人類感知的清晰度和決策效用脫鉤。此外,雖然沒有任何公式改善客觀分析師的表現,但解釋始終提高了決策信心,這在高風險環境中顯示出自動化偏見的重大風險。這些發現表明,當前的評估代理不足以預測下游的人類影響,我們提供了基於證據的指導,以選擇操作決策系統中的公式和指標。

Hidden Failure Modes of Gradient Modification under Adam in Continual Learning, and Adaptive Decoupled Moment Routing as a Repair

2604.22407v1 by Yuelin Hu, Zhenbo Yu, Zhengxue Cheng, Wei Liu, Li Song

Many continual-learning methods modify gradients upstream (e.g., projection, penalty rescaling, replay mixing) while treating Adam as a neutral backend. We show this composition has a hidden failure mode. In a high-overlap, non-adaptive 8-domain continual LM, all shared-routing projection baselines collapse close to vanilla forgetting (12.5--12.8 vs. 13.2). A 0.5% replay buffer is the strongest shared alternative but still reaches 11.6, while fixed-strength decoupling falls below vanilla at 14.1. Only adaptive decoupled routing remains stable at 9.4, improving over vanilla by 3.8 units. On a 16-domain stream, its gain over the strongest shared-routing projection baseline grows to 4.5--4.8 units. The failure is largely invisible on clean benchmarks. We explain this effect through Adam's second-moment pathway: in the tested regime, projection induces a 1/(1-alpha) inflation of the old-direction effective learning rate, matching measurements within 8% across eight alpha values. The same conflict appears with penalty methods, replay mixing, and at 7B scale under LoRA. Our fix routes the modified gradient only to the first moment while preserving magnitude-faithful second-moment statistics, with overlap-aware adaptive strength. This simple change is the only tested configuration that consistently avoids collapse across methods, optimizers, and scale.

摘要:許多持續學習方法在上游修改梯度(例如,投影、懲罰重縮、重播混合),同時將 Adam 視為中立的後端。我們展示了這種組合具有隱藏的失效模式。在一個高重疊、非自適應的 8 域持續語言模型中,所有共享路由投影基準都接近於普通遺忘(12.5--12.8 對比 13.2)。0.5% 的重播緩衝區是最強的共享替代方案,但仍然達到 11.6,而固定強度的解耦則低於普通的 14.1。只有自適應解耦路由在 9.4 的穩定性上保持不變,比普通提高了 3.8 個單位。在 16 域流中,與最強的共享路由投影基準相比,其增益增長至 4.5--4.8 個單位。這一失效在乾淨基準上大多是不可見的。
我們通過 Adam 的二階矩路徑解釋這一效應:在測試的範疇中,投影引起了舊方向有效學習率的 1/(1-alpha) 膨脹,與八個 alpha 值的測量結果相符,誤差在 8% 以內。懲罰方法、重播混合以及在 LoRA 下的 7B 規模也出現了相同的衝突。我們的解決方案僅將修改後的梯度路由到第一階矩,同時保留幅度忠實的二階矩統計,並具備重疊感知的自適應強度。這一簡單的改變是唯一經測試的配置,能夠在各種方法、優化器和規模中持續避免崩潰。

Tell Me Why: Designing an Explainable LLM-based Dialogue System for Student Problem Behavior Diagnosis

2604.22237v1 by Zhilin Fan, Deliang Wang, Penghe Chen, Yu Lu

Diagnosing student problem behaviors requires teachers to synthesize multifaceted information, identify behavioral categories, and plan intervention strategies. Although fine-tuned large language models (LLMs) can support this process through multi-turn dialogue, they rarely explain why a strategy is recommended, limiting transparency and teachers' trust. To address this issue, we present an explainable dialogue system built on a fine-tuned LLM. The system uses a hierarchical attribution method based on explainable AI (xAI) to identify dialogue evidence for each recommendation and generate a natural-language explanation based on that evidence. In technical evaluation, the method outperformed baseline approaches in identifying supporting evidence. In a preliminary user study with 22 pre-service teachers, participants who received explanations reported higher trust in the system. These findings suggest a promising direction for improving LLM explainability in educational dialogue systems.

摘要:診斷學生問題行為需要教師綜合多方面的信息、識別行為類別並規劃干預策略。雖然微調過的大型語言模型(LLMs)可以通過多輪對話支持這一過程,但它們很少解釋為什麼推薦某一策略,這限制了透明度和教師的信任。為了解決這一問題,我們提出了一個基於微調LLM的可解釋對話系統。該系統使用基於可解釋人工智慧(xAI)的層次歸因方法來識別每個推薦的對話證據,並根據該證據生成自然語言解釋。在技術評估中,該方法在識別支持證據方面超過了基準方法。在對22名預備教師的初步用戶研究中,接受解釋的參與者報告對系統的信任度更高。這些發現表明,改善LLM在教育對話系統中的可解釋性是一個有前景的方向。

Reliable Self-Harm Risk Screening via Adaptive Multi-Agent LLM Systems

2604.22154v1 by Meghana Karnam, Ananya Joshi

Emerging AI systems in behavioral health and psychiatry use multi-step or multi-agent LLM pipelines for tasks like assessing self-harm risk and screening for depression. However, common evaluation approaches, like LLM-as-a-judge, do not indicate when a decision is reliable or how errors may accumulate across multiple LLM judgements, limiting their suitability for safety-critical settings. We present a statistical framework for multi-agent pipelines structured as directed acyclic graphs (DAGs) that provides an alternative to heuristic voting with principled, adaptive decision-making. We model each agent as a stochastic categorical decision and introduce (1) tighter agent-level performance confidence bounds, (2) a bandit-based adaptive sampling strategy based on input difficulty, and (3) regret guarantees over the multi-agent system that shows logarithmic error growth when deployed. We evaluate our system on two labeled datasets in behavioral health : the AEGIS 2.0 behavioral health subset (N=161) and a stratified sample of SWMH Reddit posts (N=250). Empirically, our adaptive sampling strategy achieves the lowest false positive rate of any condition across both datasets, 0.095 on AEGIS 2.0 compared to 0.159 for single-agent models, reducing incorrect flagging of safe content by 40\% and still having similar false negative rates across all conditions. These results suggest that principled adaptive sampling offers a meaningful improvement in precision without reducing recall in this setting.

摘要:新興的行為健康和精神病學中的人工智慧系統使用多步驟或多代理的LLM管道來執行評估自我傷害風險和篩檢抑鬱症等任務。然而,常見的評估方法,如LLM作為裁判,並未指示何時決策是可靠的,或如何在多個LLM判斷中累積錯誤,這限制了它們在安全關鍵環境中的適用性。我們提出了一個統計框架,針對結構為有向無環圖(DAG)的多代理管道,提供了一種基於原則的、自適應的決策制定替代啟發式投票的方法。我們將每個代理建模為隨機類別決策,並引入(1)更緊的代理級性能信心界限,(2)基於輸入難度的強盜式自適應抽樣策略,以及(3)在多代理系統上提供的懊悔保證,顯示在部署時的對數錯誤增長。我們在行為健康的兩個標記數據集上評估我們的系統:AEGIS 2.0行為健康子集(N=161)和SWMH Reddit帖子的一個分層樣本(N=250)。從實證上看,我們的自適應抽樣策略在這兩個數據集中達到了最低的假陽性率,AEGIS 2.0為0.095,而單代理模型為0.159,將安全內容的錯誤標記減少了40\%,並且在所有條件下仍然保持相似的假陰性率。這些結果表明,基於原則的自適應抽樣在不降低召回率的情況下,提供了精確度的有意義改善。

Spontaneous Persuasion: An Audit of Model Persuasiveness in Everyday Conversations

2604.22109v1 by Nalin Poungpeth, Nicholas Clark, Tanu Mitra

Large language models (LLMs) possess strong persuasive capabilities that outperform humans in head-to-head comparisons. Users report consulting LLMs to inform major life decisions in relationships, medical settings, and when seeking professional advice. Prior work measures persuasion as intentional attempts at producing the most effective argument or convincing statement. This fails to capture everyday human-AI interactions in which users seek information or advice. To address this gap, we introduce "spontaneous persuasion," which characterizes the inexplicit use of persuasive strategies in everyday scenarios where persuasion is not necessarily warranted. We conduct an audit of five LLMs to uncover how frequently and through which techniques spontaneous persuasion appears in multi-turn conversations. To simulate response styles, we provide a user response taxonomy grounded in literature from psychology, communication, and linguistics. Furthermore, we compare the distribution of spontaneous persuasion produced by LLMs with human responses on the same topics, collected from Reddit. We find LLMs spontaneously persuade the user in virtually all conversations, heavily relying on information-based strategies such as appeals to logic or quantitative evidence. This was consistent across models and user response styles, but conversations concerning mental health saw higher rates of appraisal-based and emotion-based strategies. In comparison, human responses tended to invoke strategies that generate social influence, like negative emotion appeals and non-expert testimony. This difference may explain the effectiveness of LLM in persuading users, as well as the perception of models as objective and impartial.

摘要:大型語言模型(LLMs)擁有強大的說服能力,在一對一比較中超越人類。使用者報告表示,在關係、醫療環境以及尋求專業建議時,會諮詢LLMs以協助做出重大生活決策。先前的研究將說服測量為產生最有效論點或令人信服陳述的有意圖嘗試。這未能捕捉到日常人類與AI互動中的情況,使用者在這些互動中尋求資訊或建議。為了解決這一空白,我們引入了「自發性說服」,其特徵是在人們不一定需要說服的日常情境中隱性使用說服策略。我們對五個LLMs進行了審核,以揭示自發性說服在多輪對話中出現的頻率及其技術。為了模擬回應風格,我們提供了一個基於心理學、溝通學和語言學文獻的使用者回應分類法。此外,我們比較了LLMs在相同主題上產生的自發性說服與從Reddit收集的人類回應的分佈。我們發現LLMs幾乎在所有對話中都自發地說服使用者,並大量依賴基於資訊的策略,例如訴諸邏輯或定量證據。這在各模型和使用者回應風格中是一致的,但涉及心理健康的對話中,基於評價和情感的策略的使用率較高。相比之下,人類回應則傾向於使用產生社會影響的策略,如負面情感訴求和非專家證言。這一差異可能解釋了LLM在說服使用者方面的有效性,以及模型被視為客觀和公正的感知。

Optimal Question Selection from a Large Question Bank for Clinical Field Recovery in Conversational Psychiatric Intake

2604.22067v1 by Guan Gui, Peter Zandi, Jacob Taylor, Ananya Joshi

Psychiatric intake is a sequential, high-stakes information-gathering process in which clinicians must decide what to ask, in what order, and how to interpret incomplete or ambiguous responses under limited time. Despite growing interest in conversational AI for healthcare, there is still limited infrastructure for conversational AI in this application. Accordingly, we formulate this task as a question-selection problem with clinically grounded questions, known target information, and controllable patient difficulty. We also introduce a task-specific question-selection benchmark based on a bank of 655 clinician-authored intake questions and corresponding synthetic patient vignettes with 5 different behavioral conditions. In our evaluation, we compare random questioning, a clinical psychiatric intake form baseline, and an LLM-guided adaptive policy across 300 interview sessions spanning four patients and five behavioral conditions. Across the benchmark, the clinically ordered fixed form substantially outperforms random questioning, and the LLM-guided policy achieves the strongest overall recovery. The advantage of adaptation grows sharply under patient behavior that is less amenable to field recovery, especially under guarded-concise conditions. These findings suggest that performance in conversational clinical systems depends not only on language understanding after information is disclosed, but also on whether the system reaches the right topics within a limited interaction budget. More broadly, the benchmark provides a controlled framework for studying how clinical structure and adaptive follow-up contribute to information recovery in interactive clinical machine learning.

摘要:精神科接診是一個連續的、高風險的信息收集過程,臨床醫生必須決定提問的內容、順序以及如何在有限的時間內解釋不完整或模糊的回答。儘管對於醫療保健中的對話式人工智慧的興趣日益增長,但在這一應用中,對話式人工智慧的基礎設施仍然有限。因此,我們將這一任務表述為一個問題選擇問題,涉及臨床上有根據的問題、已知的目標信息以及可控的患者難度。我們還基於655個臨床醫生撰寫的接診問題庫和5種不同行為條件的相應合成患者小品,介紹了一個特定任務的問題選擇基準。在我們的評估中,我們比較了隨機提問、一個臨床精神科接診表的基準,以及一個基於大型語言模型(LLM)指導的自適應政策,這涉及300次訪談會議,涵蓋四位患者和五種行為條件。在基準測試中,臨床有序的固定形式顯著優於隨機提問,而LLM指導的政策則實現了最強的整體恢復。在患者行為對現場恢復的適應性較差的情況下,適應的優勢急劇增長,尤其是在防守性簡潔的條件下。這些發現表明,對話式臨床系統的表現不僅取決於信息披露後的語言理解,還取決於系統是否能在有限的互動預算內觸及正確的主題。更廣泛地說,這一基準提供了一個受控框架,用於研究臨床結構和自適應後續如何促進互動式臨床機器學習中的信息恢復。

Reliability Auditing for Downstream LLM tasks in Psychiatry: LLM-Generated Hospitalization Risk Scores

2604.22063v1 by Shevya Pandya, Shinjini Bose, Ananya Joshi

Large language models (LLMs) are increasingly utilized in clinical reasoning and risk assessment. However, their interpretive reliability in critical and indeterminate domains such as psychiatry remains unclear. Prior work has identified algorithmic biases and prompt sensitivity in these systems, raising concerns about how contextual information may influence model outputs, but there remains no systematic way to assess these, especially in the psychiatric domain. We propose an approach for reliability auditing downstream LLM tasks by structuring evaluation around the impact of prompt design and the inclusion of medically insignificant inputs on predicted hospitalization risk scores, which is often the first downstream AI clinical-decision-making task. In our audit, a cohort of synthetic patient profiles (n = 50) is generated, each consisting of 15 clinically relevant features and up to 50 clinically insignificant features, across four prompt reframings (neutral, logical, human impact, clinical judgment). We audit four LLMs (Gemini 2.5 Flash, LLaMa 3.3 70b, Claude Sonnet 4.6, GPT-4o mini), and our results show that including medically insignificant variables resulted in a statistically significant increase in the absolute mean predicted hospitalization risk and output variability across all models and prompts, indicating reduced predictive stability as contextual noise increased. Clinically insignificant features had an effect on instability across many model-prompt conditions, and prompt variations independently affected the trajectory of instability in a model-dependent manner. These findings quantify how LLM-based psychiatric risk assessments are sensitive to non-clinical information, highlighting the need for systematic evaluations of attributional stability and uncertainty behavior like this before clinical deployments.

摘要:大型語言模型(LLMs)在臨床推理和風險評估中被越來越多地使用。然而,它們在精神科等關鍵和不確定領域的解釋可靠性仍然不明。先前的研究已經識別出這些系統中的算法偏見和提示敏感性,這引發了關於上下文信息如何影響模型輸出的擔憂,但在精神科領域仍然沒有系統的方法來評估這些問題。我們提出了一種通過圍繞提示設計的影響和醫學上不重要的輸入對預測住院風險分數的影響來結構化評估的可靠性審核方法,這通常是第一個下游AI臨床決策任務。在我們的審核中,生成了一組合成患者資料(n = 50),每個資料包含15個臨床相關特徵和最多50個臨床不重要特徵,跨越四種提示重構(中立、邏輯、人類影響、臨床判斷)。我們審核了四個LLM(Gemini 2.5 Flash,LLaMa 3.3 70b,Claude Sonnet 4.6,GPT-4o mini),結果顯示,包含醫學上不重要的變量導致所有模型和提示的絕對平均預測住院風險和輸出變異性有統計學上顯著的增加,這表明隨著上下文噪音的增加,預測穩定性降低。臨床不重要特徵在許多模型-提示條件下對不穩定性產生了影響,而提示變化獨立地以模型依賴的方式影響不穩定性的軌跡。這些發現量化了基於LLM的精神科風險評估對非臨床信息的敏感性,突顯了在臨床部署之前需要對歸因穩定性和不確定性行為進行系統評估的必要性。

H-Sets: Hessian-Guided Discovery of Set-Level Feature Interactions in Image Classifiers

2604.22045v1 by Ayushi Mehrotra, Dipkamal Bhusal, Michael Clifford, Nidhi Rastogi

Feature attribution methods explain the predictions of deep neural networks by assigning importance scores to individual input features. However, most existing methods focus solely on marginal effects, overlooking feature interactions, where groups of features jointly influence model output. Such interactions are especially important in image classification tasks, where semantic meaning often arises from pixel interdependencies rather than isolated features. Existing interaction-based methods for images are either coarse (e.g., superpixel-only) or, fail to satisfy core interpretability axioms. In this work, we introduce H-Sets, a novel two-stage framework for discovering and attributing higher-order feature interactions in image classifiers. First, we detect locally interacting pairs via input Hessians and recursively merge them into semantically coherent sets; segmentation from Segment Anything (SAM) is used as a spatial grouping prior but can be replaced by other segmentations. Second, we attribute each set with IDG-Vis, a set-level extension of Integrated Directional Gradients that integrates directional gradients along pixel-space paths and aggregates them with Harsanyi dividends. While Hessians introduce additional compute at the detection stage, this targeted cost consistently yields saliency maps that are sparser and more faithful. Evaluations across VGG, ResNet, DenseNet and MobileNet models on ImageNet and CUB datasets show that H-Sets generate more interpretable and faithful saliency maps compared to existing methods.

摘要:特徵歸因方法通過為單個輸入特徵分配重要性分數來解釋深度神經網絡的預測。然而,大多數現有方法僅專注於邊際效應,忽略了特徵之間的交互作用,這些交互作用是特徵組共同影響模型輸出的情況。這種交互作用在圖像分類任務中特別重要,因為語義意義通常來自像素之間的相互依賴,而不是孤立的特徵。現有的基於交互作用的圖像方法要麼過於粗糙(例如,僅使用超像素),要麼未能滿足核心可解釋性公理。在這項工作中,我們介紹了 H-Sets,一種新穎的兩階段框架,用於發現和歸因於圖像分類器中的高階特徵交互作用。首先,我們通過輸入 Hessians 檢測局部交互對,並將它們遞歸地合併成語義上連貫的集合;使用 Segment Anything (SAM) 進行分割作為空間分組的先驗,但可以用其他分割方法替代。其次,我們使用 IDG-Vis 為每個集合進行歸因,這是一種集級擴展的整合方向梯度,將沿像素空間路徑的方向梯度整合並與 Harsanyi 分紅進行聚合。雖然 Hessians 在檢測階段引入了額外的計算成本,但這種有針對性的成本始終能產生更稀疏且更真實的顯著性圖。在 ImageNet 和 CUB 數據集上對 VGG、ResNet、DenseNet 和 MobileNet 模型的評估顯示,H-Sets 生成的顯著性圖比現有方法更具可解釋性和真實性。

EgoMAGIC- An Egocentric Video Field Medicine Dataset for Training Perception Algorithms

2604.22036v1 by Brian VanVoorst, Nicholas Walczak, Christopher Gilleo, Charles Meissner, Fabio Felix, Iran Roman, Bea Steers, Claudio Silva, Yuhan Shen, Zijia Lu, Shih-Po Lee, Ehsan Elhamifar

This paper introduces EgoMAGIC (Medical Assistance, Guidance, Instruction, and Correction), an egocentric medical activity dataset collected as part of DARPA's Perceptually-enabled Task Guidance (PTG) program. This dataset comprises 3,355 videos of 50 medical tasks, with at least 50 labeled videos per task. The primary objective of the PTG program was to develop virtual assistants integrated into augmented reality headsets to assist users in performing complex tasks. To encourage exploration and research using this dataset, the medical training data has been released along with an action detection challenge focused on eight medical tasks. The majority of the videos were recorded using a head-mounted stereo camera with integrated audio. From this dataset, 40 YOLO models were trained using 1.95 million labels to detect 124 medical objects, providing a robust starting point for developers working on medical AI applications. In addition to introducing the dataset, this paper presents baseline results on action detection for the eight selected medical tasks across three models, with the best-performing method achieving average mAP 0.526. Although this paper primarily addresses action detection as the benchmark, the EgoMAGIC dataset is equally suitable for action recognition, object identification and detection, error detection, and other challenging computer vision tasks. The dataset is accessible via zenodo.org (DOI: 10.5281/zenodo.19239154).

摘要:這篇論文介紹了EgoMAGIC(醫療輔助、指導、說明和修正),這是一個以自我為中心的醫療活動數據集,作為DARPA的感知能力任務指導(PTG)計畫的一部分收集而成。這個數據集包含3,355個視頻,涵蓋50個醫療任務,每個任務至少有50個標記視頻。PTG計畫的主要目標是開發集成在增強現實頭盔中的虛擬助手,以幫助用戶執行複雜任務。
為了鼓勵使用這個數據集進行探索和研究,醫療訓練數據已經發布,並附帶了一個專注於八個醫療任務的動作檢測挑戰。大多數視頻是使用帶有集成音頻的頭戴立體攝像機錄製的。從這個數據集中,使用195萬個標籤訓練了40個YOLO模型,以檢測124個醫療物體,為從事醫療AI應用開發的開發者提供了一個穩健的起點。
除了介紹數據集,這篇論文還呈現了三個模型在八個選定醫療任務上的動作檢測基準結果,其中表現最佳的方法達到了平均mAP 0.526。儘管這篇論文主要針對動作檢測作為基準,但EgoMAGIC數據集同樣適用於動作識別、物體識別和檢測、錯誤檢測以及其他具有挑戰性的計算機視覺任務。
該數據集可通過zenodo.org訪問(DOI: 10.5281/zenodo.19239154)。

Shared Lexical Task Representations Explain Behavioral Variability In LLMs

2604.22027v1 by Zhuonan Yang, Jacob Xiaochen Li, Francisco Piedrahita Velez, Eric Todd, David Bau, Michael L. Littman, Stephen H. Bach, Ellie Pavlick

One of the most common complaints about large language models (LLMs) is their prompt sensitivity -- that is, the fact that their ability to perform a task or provide a correct answer to a question can depend unpredictably on the way the question is posed. We investigate this variation by comparing two very different but commonly-used styles of prompting: instruction-based prompts, which describe the task in natural language, and example-based prompts, which provide in-context few-shot demonstration pairs to illustrate the task. We find that, despite large variation in performance as a function of the prompt, the model engages some common underlying mechanisms across different prompts of a task. Specifically, we identify task-specific attention heads whose outputs literally describe the task -- which we dub lexical task heads -- and show that these heads are shared across prompting styles and trigger subsequent answer production. We further find that behavioral variation between prompts can be explained by the degree to which these heads are activated, and that failures are at least sometimes due to competing task representations that dilute the signal of the target task. Our results together present an increasingly clear picture of how LLMs' internal representations can explain behavior that otherwise seems idiosyncratic to users and developers.

摘要:對大型語言模型(LLMs)最常見的抱怨之一是它們對提示的敏感性——也就是說,它們執行任務或提供正確答案的能力可能會不可預測地依賴於問題的表述方式。我們通過比較兩種非常不同但常用的提示風格來調查這種變化:基於指令的提示,這種提示用自然語言描述任務,以及基於示例的提示,這種提示提供上下文中的少量示範對以說明任務。我們發現,儘管性能在提示的影響下有很大的變化,但模型在不同提示的任務之間仍然會涉及一些共同的基本機制。具體而言,我們識別出任務特定的注意力頭,其輸出字面上描述了任務——我們稱之為詞彙任務頭——並顯示這些頭在不同的提示風格之間是共享的,並觸發隨後的答案生成。我們進一步發現,提示之間的行為變化可以通過這些頭的激活程度來解釋,而失敗至少有時是由於競爭的任務表徵稀釋了目標任務的信號。我們的結果共同呈現出一幅日益清晰的圖景,說明LLMs的內部表徵如何解釋那些對用戶和開發者來說似乎是特立獨行的行為。

Fine-Grained Perspectives: Modeling Explanations with Annotator-Specific Rationales

2604.21667v1 by Olufunke O. Sarumi, Charles Welch, Daniel Braun

Beyond exploring disaggregated labels for modeling perspectives, annotator rationales provide fine-grained signals of individual perspectives. In this work, we propose a framework for jointly modeling annotator-specific label prediction and corresponding explanations, fine-tuned on the annotators' provided rationales. Using a dataset with disaggregated natural language inference (NLI) annotations and annotator-provided explanations, we condition predictions on both annotator identity and demographic metadata through a representation-level User Passport mechanism. We further introduce two explainer architectures: a post-hoc prompt-based explainer and a prefixed bridge explainer that transfers annotator-conditioned classifier representations directly into a generative model. This design enables explanation generation aligned with individual annotator perspectives. Our results show that incorporating explanation modeling substantially improves predictive performance over a baseline annotator-aware classifier, with the prefixed bridge approach achieving more stable label alignment and higher semantic consistency, while the post-hoc approach yields stronger lexical similarity. These findings indicate that modeling explanations as expressions of fine-grained perspective provides a richer and more faithful representation of disagreement. The proposed approaches advance perspectivist modeling by integrating annotator-specific rationales into both predictive and generative components.

摘要:超越探索用於建模觀點的細分標籤,標註者的理由提供了個別觀點的細緻信號。在這項工作中,我們提出了一個框架,用於共同建模標註者特定的標籤預測及其相應的解釋,並根據標註者提供的理由進行微調。使用一個包含細分自然語言推理(NLI)標註和標註者提供解釋的數據集,我們通過一個表示層級的用戶護照機制,將預測條件化於標註者身份和人口統計元數據。我們進一步引入了兩種解釋器架構:一種是事後提示基解釋器,另一種是前綴橋接解釋器,該解釋器將標註者條件化的分類器表示直接轉換為生成模型。這一設計使得解釋生成與個別標註者的觀點對齊。我們的結果顯示,納入解釋建模顯著提高了相對於基線標註者感知分類器的預測性能,其中前綴橋接方法實現了更穩定的標籤對齊和更高的語義一致性,而事後方法則產生了更強的詞彙相似性。這些發現表明,將解釋建模為細緻觀點的表達提供了更豐富和更真實的分歧表示。所提出的方法通過將標註者特定的理由整合到預測和生成組件中,推進了觀點主義建模。

Task-specific Subnetwork Discovery in Reinforcement Learning for Autonomous Underwater Navigation

2604.21640v1 by Yi-Ling Liu, Melvin Laux, Mariela De Lucas Alvarez, Frank Kirchner, Rebecca Adam

Autonomous underwater vehicles are required to perform multiple tasks adaptively and in an explainable manner under dynamic, uncertain conditions and limited sensing, challenges that classical controllers struggle to address. This demands robust, generalizable, and inherently interpretable control policies for reliable long-term monitoring. Reinforcement learning, particularly multi-task RL, overcomes these limitations by leveraging shared representations to enable efficient adaptation across tasks and environments. However, while such policies show promising results in simulation and controlled experiments, they yet remain opaque and offer limited insight into the agent's internal decision-making, creating gaps in transparency, trust, and safety that hinder real-world deployment. The internal policy structure and task-specific specialization remain poorly understood. To address these gaps, we analyze the internal structure of a pretrained multi-task reinforcement learning network in the HoloOcean simulator for underwater navigation by identifying and comparing task-specific subnetworks responsible for navigating toward different species. We find that in a contextual multi-task reinforcement learning setting with related tasks, the network uses only about 1.5% of its weights to differentiate between tasks. Of these, approximately 85% connect the context-variable nodes in the input layer to the next hidden layer, highlighting the importance of context variables in such settings. Our approach provides insights into shared and specialized network components, useful for efficient model editing, transfer learning, and continual learning for underwater monitoring through a contextual multi-task reinforcement learning method.

摘要:自主水下航行器需要在動態、不確定的條件下以及有限的感測能力下,自適應地執行多項任務並以可解釋的方式進行,這是傳統控制器難以應對的挑戰。這要求制定穩健、可泛化且本質上可解釋的控制政策,以便進行可靠的長期監測。強化學習,特別是多任務強化學習,通過利用共享表示來克服這些限制,從而實現跨任務和環境的高效適應。然而,儘管這些政策在模擬和受控實驗中顯示出良好的結果,但它們仍然不透明,並且對代理的內部決策過程提供有限的洞察,造成透明度、信任和安全性方面的缺口,阻礙了在現實世界中的部署。內部政策結構和任務特定的專業化仍然不甚了解。為了解決這些缺口,我們分析了在HoloOcean模擬器中預訓練的多任務強化學習網絡的內部結構,通過識別和比較負責導航不同物種的任務特定子網絡。我們發現,在一個具有相關任務的上下文多任務強化學習環境中,該網絡僅使用約1.5%的權重來區分不同任務。在這些權重中,大約85%將上下文變量節點與下一個隱藏層相連,突顯了上下文變量在這種環境中的重要性。我們的方法提供了對共享和專門化網絡組件的洞察,對於通過上下文多任務強化學習方法進行水下監測的高效模型編輯、遷移學習和持續學習具有重要意義。

On the Role of Preprocessing and Memristor Dynamics in Reservoir Computing for Image Classification

2604.21602v1 by Rishona Daniels, Duna Wattad, Ronny Ronen, David Saad, Shahar Kvatinsky

Reservoir computing (RC) is an emerging recurrent neural network architecture that has attracted growing attention for its low training cost and modest hardware requirements. Memristor-based circuits are particularly promising for RC, as their intrinsic dynamics can reduce network size and parameter overhead in tasks such as time-series prediction and image recognition. Although RC has been demonstrated with several memristive devices, a comprehensive evaluation of device-level requirements remains limited. In this paper, we analyze and explain the operation of a parallel delayed feedback network (PDFN) RC architecture with volatile memristors, focusing on how device characteristics -- such as decay rate, quantization, and variability -- affect reservoir performance. We further discuss strategies to improve data representation in the reservoir using preprocessing methods and suggest potential improvements. The proposed approach achieves 95.89% classification accuracy on MNIST, comparable with the best reported memristor-based RC implementations. Furthermore, the method maintains high robustness under 20% device variability, achieving an accuracy of up to 94.2%. These results demonstrate that volatile memristors can support reliable spatio-temporal information processing and reinforce their potential as key building blocks for compact, high-speed, and energy-efficient neuromorphic computing systems.

摘要:儲水器計算(RC)是一種新興的遞迴神經網絡架構,因其低訓練成本和適度的硬體需求而受到越來越多的關注。基於記憶電阻的電路對於RC特別有前景,因為它們的內在動態可以減少在時間序列預測和圖像識別等任務中的網絡大小和參數開銷。儘管RC已經在幾種記憶電阻設備上得到了驗證,但對於設備級需求的全面評估仍然有限。在本文中,我們分析並解釋了一種具有揮發性記憶電阻的平行延遲反饋網絡(PDFN)RC架構的運作,重點關注設備特性——如衰減速率、量化和變異性——如何影響儲水器的性能。我們進一步討論了使用預處理方法改善儲水器中數據表示的策略,並提出潛在的改進建議。所提出的方法在MNIST上達到了95.89%的分類準確率,與報導的最佳基於記憶電阻的RC實現相當。此外,該方法在20%的設備變異性下保持了高穩健性,準確率達到94.2%。這些結果表明,揮發性記憶電阻可以支持可靠的時空信息處理,並強化其作為緊湊、高速和節能的類腦計算系統關鍵組件的潛力。

Dynamical Priors as a Training Objective in Reinforcement Learning

2604.21464v1 by Sukesh Subaharan

Standard reinforcement learning (RL) optimizes policies for reward but imposes few constraints on how decisions evolve over time. As a result, policies may achieve high performance while exhibiting temporally incoherent behavior such as abrupt confidence shifts, oscillations, or degenerate inactivity. We introduce Dynamical Prior Reinforcement Learning (DP-RL), a training framework that augments policy gradient learning with an auxiliary loss derived from external state dynamics that implement evidence accumulation and hysteresis. Without modifying the reward, environment, or policy architecture, this prior shapes the temporal evolution of action probabilities during learning. Across three minimal environments, we show that dynamical priors systematically alter decision trajectories in task-dependent ways, promoting temporally structured behavior that cannot be explained by generic smoothing. These results demonstrate that training objectives alone can control the temporal geometry of decision-making in RL agents.

摘要:標準強化學習(RL)優化獎勵的政策,但對決策隨時間演變的方式施加的約束很少。因此,政策可能在表現良好的同時,顯示出時間上不一致的行為,例如突然的信心轉變、振盪或退化的不活動。我們引入了動態先驗強化學習(DP-RL),這是一個訓練框架,通過來自外部狀態動力學的輔助損失來增強政策梯度學習,該動力學實現了證據累積和遲滯。在不修改獎勵、環境或政策架構的情況下,這個先驗在學習過程中塑造了行動概率的時間演變。在三個最小環境中,我們展示了動態先驗以系統性的方式改變決策軌跡,這些變化依賴於任務,促進了無法用一般平滑解釋的時間結構化行為。這些結果表明,僅僅訓練目標就可以控制RL代理的決策時間幾何。

Trustworthy Clinical Decision Support Using Meta-Predicates and Domain-Specific Languages

2604.21263v1 by Michael Bouzinier, Sergey Trifonov, Michael Chumack, Eugenia Lvova, Dmitry Etin

\textbf{Background:} Regulatory frameworks for AI in healthcare, including the EU AI Act and FDA guidance on AI/ML-based medical devices, require clinical decision support to demonstrate not only accuracy but auditability. Existing formal languages for clinical logic validate syntactic and structural correctness but not whether decision rules use epistemologically appropriate evidence. \textbf{Methods:} Drawing on design-by-contract principles, we introduce meta-predicates -- predicates about predicates -- for asserting epistemological constraints on clinical decision rules expressed in a DSL. An epistemological type system classifies annotations along four dimensions: purpose, knowledge domain, scale, and method of acquisition. Meta-predicates assert which evidence types are permissible in any given rule. The framework is instantiated in AnFiSA, an open-source platform for genetic variant curation, and demonstrated using the Brigham Genomics Medicine protocol on 5.6 million variants from the Genome in a Bottle benchmark. \textbf{Results:} Decision trees used in variant interpretation can be reformulated as unate cascades, enabling per-variant audit trails that identify which rule classified each variant and why. Meta-predicate validation catches epistemological errors before deployment, whether rules are human-written or AI-generated. The approach complements post-hoc methods such as LIME and SHAP: where explanation reveals what evidence was used after the fact, meta-predicates constrain what evidence may be used before deployment, while preserving human readability. \textbf{Conclusions:} Meta-predicate validation is a step toward demonstrating not only that decisions are accurate but that they rest on appropriate evidence in ways that can be independently audited. While demonstrated in genomics, the approach generalises to any domain requiring auditable decision logic.

摘要:\textbf{背景:} 醫療保健中人工智慧的監管框架,包括歐盟人工智慧法案和FDA對基於人工智慧/機器學習醫療設備的指導,要求臨床決策支持不僅要顯示準確性,還要具備可審計性。現有的臨床邏輯形式語言驗證語法和結構的正確性,但不驗證決策規則是否使用了認識論上合適的證據。 \textbf{方法:} 基於契約設計原則,我們引入了元謂詞——關於謂詞的謂詞——用於對在DSL中表達的臨床決策規則施加認識論約束。認識論類型系統在四個維度上對註釋進行分類:目的、知識領域、範圍和獲取方法。元謂詞聲明在任何給定規則中允許使用哪些證據類型。該框架在AnFiSA中實現,這是一個開源的基因變異整理平台,並使用來自“瓶中基因組”基準的560萬個變異的Brigham Genomics Medicine協議進行演示。 \textbf{結果:} 用於變異解釋的決策樹可以重新表述為單調級聯,從而實現每個變異的審計跟蹤,識別每個變異的分類規則及其原因。元謂詞驗證在部署前捕捉認識論錯誤,無論規則是人工編寫還是AI生成。該方法補充了事後方法,如LIME和SHAP:當解釋揭示了事後使用了哪些證據時,元謂詞限制了在部署前可以使用的證據,同時保持人類可讀性。 \textbf{結論:} 元謂詞驗證是邁向證明決策不僅準確且基於適當證據的步驟,並且這些證據可以獨立審計。雖然在基因組學中得到了演示,但該方法可以推廣到任何需要可審計決策邏輯的領域。

Agentic AI for Personalized Physiotherapy: A Multi-Agent Framework for Generative Video Training and Real-Time Pose Correction

2604.21154v1 by Abhishek Dharmaratnakar, Srivaths Ranganathan, Anushree Sinha, Debanshu Das

At-home physiotherapy compliance remains critically low due to a lack of personalized supervision and dynamic feedback. Existing digital health solutions rely on static, pre-recorded video libraries or generic 3D avatars that fail to account for a patient's specific injury limitations or home environment. In this paper, we propose a novel Multi-Agent System (MAS) architecture that leverages Generative AI and computer vision to close the tele-rehabilitation loop. Our framework consists of four specialized micro-agents: a Clinical Extraction Agent that parses unstructured medical notes into kinematic constraints; a Video Synthesis Agent that utilizes foundational video generation models to create personalized, patient-specific exercise videos; a Vision Processing Agent for real-time pose estimation; and a Diagnostic Feedback Agent that issues corrective instructions. We present the system architecture, detail the prototype pipeline using Large Language Models and MediaPipe, and outline our clinical evaluation plan. This work demonstrates the feasibility of combining generative media with agentic autonomous decision-making to scale personalized patient care safely and effectively.

摘要:居家物理治療的遵從率仍然極低,原因在於缺乏個性化的監督和動態反饋。現有的數位健康解決方案依賴於靜態的預錄影片庫或通用的3D虛擬角色,這些都未能考慮到患者特定的受傷限制或家庭環境。在本文中,我們提出了一種新穎的多智能體系統(MAS)架構,利用生成式人工智慧和計算機視覺來閉合遠程康復的循環。我們的框架由四個專門的微智能體組成:一個臨床提取智能體,將非結構化的醫療筆記解析為運動學約束;一個視頻合成智能體,利用基礎視頻生成模型創建個性化的、針對患者的運動視頻;一個視覺處理智能體,用於實時姿勢估計;以及一個診斷反饋智能體,提供糾正指導。我們展示了系統架構,詳細說明了使用大型語言模型和MediaPipe的原型管道,並概述了我們的臨床評估計劃。本研究展示了將生成媒體與自主決策相結合的可行性,以安全有效地擴展個性化患者護理。

Propensity Inference: Environmental Contributors to LLM Behaviour

2604.21098v1 by Olli Järviniemi, Oliver Makins, Jacob Merizian, Robert Kirk, Ben Millwood

Motivated by loss of control risks from misaligned AI systems, we develop and apply methods for measuring language models' propensity for unsanctioned behaviour. We contribute three methodological improvements: analysing effects of changes to environmental factors on behaviour, quantifying effect sizes via Bayesian generalised linear models, and taking explicit measures against circular analysis. We apply the methodology to measure the effects of 12 environmental factors (6 strategic in nature, 6 non-strategic) and thus the extent to which behaviour is explained by strategic aspects of the environment, a question relevant to risks from misalignment. Across 23 language models and 11 evaluation environments, we find approximately equal contributions from strategic and non-strategic factors for explaining behaviour, do not find strategic factors becoming more or less influential as capabilities improve, and find some evidence for a trend for increased sensitivity to goal conflicts. Finally, we highlight a key direction for future propensity research: the development of theoretical frameworks and cognitive models of AI decision-making into empirically testable forms.

摘要:受到不當對齊的人工智慧系統所帶來的失控風險的驅動,我們開發並應用測量語言模型未經授權行為傾向的方法。我們貢獻了三項方法論改進:分析環境因素變化對行為的影響、通過貝葉斯廣義線性模型量化效應大小,以及採取明確措施以防止循環分析。我們應用這一方法論來測量12個環境因素(6個具有戰略性質,6個非戰略性)的影響,從而了解行為在多大程度上受到環境戰略方面的解釋,這是一個與不當對齊風險相關的問題。在23個語言模型和11個評估環境中,我們發現戰略和非戰略因素對解釋行為的貢獻大致相等,並未發現隨著能力的提高,戰略因素變得更具影響力或更不具影響力,並且發現一些證據顯示對目標衝突的敏感性有增加的趨勢。最後,我們強調未來傾向研究的一個關鍵方向:將人工智慧決策的理論框架和認知模型發展為可經驗驗證的形式。

SGD at the Edge of Stability: The Stochastic Sharpness Gap

2604.21016v1 by Fangshuo Liao, Afroditi Kolomvaki, Anastasios Kyrillidis

When training neural networks with full-batch gradient descent (GD) and step size $η$, the largest eigenvalue of the Hessian -- the sharpness $S(\boldsymbolθ)$ -- rises to $2/η$ and hovers there, a phenomenon termed the Edge of Stability (EoS). \citet{damian2023selfstab} showed that this behavior is explained by a self-stabilization mechanism driven by third-order structure of the loss, and that GD implicitly follows projected gradient descent (PGD) on the constraint $ S(\boldsymbolθ)\leq 2/η$. For mini-batch stochastic gradient descent (SGD), the sharpness stabilizes below $2/η$, with the gap widening as the batch size decreases; yet no theoretical explanation exists for this suppression. We introduce stochastic self-stabilization, extending the self-stabilization framework to SGD. Our key insight is that gradient noise injects variance into the oscillatory dynamics along the top Hessian eigenvector, strengthening the cubic sharpness-reducing force and shifting the equilibrium below $2/η$. Following the approach of \citet{damian2023selfstab}, we define stochastic predicted dynamics relative to a moving projected gradient descent trajectory and prove a stochastic coupling theorem that bounds the deviation of SGD from these predictions. We derive a closed-form equilibrium sharpness gap: $ΔS = ηβσ_{\boldsymbol{u}}^{2}/(4α)$, where $α$ is the progressive sharpening rate, $β$ is the self-stabilization strength, and $σ_{ \boldsymbol{u}}^{2}$ is the gradient noise variance projected onto the top eigenvector. This formula predicts that smaller batch sizes yield flatter solutions and recovers GD when the batch equals the full dataset.

摘要:當使用全批次梯度下降(GD)和步長 $η$ 訓練神經網絡時,Hessian 的最大特徵值——銳度 $S(\boldsymbolθ)$——上升至 $2/η$ 並保持在那裡,這一現象被稱為穩定性邊緣(EoS)。\citet{damian2023selfstab} 表明,這種行為是由損失的三階結構驅動的自我穩定機制解釋的,並且 GD 隱式遵循約束 $ S(\boldsymbolθ)\leq 2/η$ 的投影梯度下降(PGD)。對於小批量隨機梯度下降(SGD),銳度在 $2/η$ 以下穩定,隨著批量大小的減小,差距擴大;然而,對於這種抑制尚無理論解釋。我們引入隨機自我穩定化,將自我穩定框架擴展至 SGD。我們的關鍵見解是,梯度噪聲為沿著頂部 Hessian 特徵向量的振蕩動力學注入了方差,增強了立方銳度減少力並將平衡點移至 $2/η$ 之下。遵循 \citet{damian2023selfstab} 的方法,我們定義了相對於移動投影梯度下降軌跡的隨機預測動力學,並證明了一個隨機耦合定理,該定理界定了 SGD 與這些預測之間的偏差。我們推導出一個封閉形式的平衡銳度差距:$ΔS = ηβσ_{\boldsymbol{u}}^{2}/(4α)$,其中 $α$ 是漸進銳化率,$β$ 是自我穩定強度,$σ_{ \boldsymbol{u}}^{2}$ 是投影到頂部特徵向量的梯度噪聲方差。這個公式預測較小的批量大小會產生更平坦的解,並在批量等於完整數據集時恢復 GD。

Convergent Evolution: How Different Language Models Learn Similar Number Representations

2604.20817v1 by Deqing Fu, Tianyi Zhou, Mikhail Belkin, Vatsal Sharan, Robin Jia

Language models trained on natural text learn to represent numbers using periodic features with dominant periods at $T=2, 5, 10$. In this paper, we identify a two-tiered hierarchy of these features: while Transformers, Linear RNNs, LSTMs, and classical word embeddings trained in different ways all learn features that have period-$T$ spikes in the Fourier domain, only some learn geometrically separable features that can be used to linearly classify a number mod-$T$. To explain this incongruity, we prove that Fourier domain sparsity is necessary but not sufficient for mod-$T$ geometric separability. Empirically, we investigate when model training yields geometrically separable features, finding that the data, architecture, optimizer, and tokenizer all play key roles. In particular, we identify two different routes through which models can acquire geometrically separable features: they can learn them from complementary co-occurrence signals in general language data, including text-number co-occurrence and cross-number interaction, or from multi-token (but not single-token) addition problems. Overall, our results highlight the phenomenon of convergent evolution in feature learning: A diverse range of models learn similar features from different training signals.

摘要:語言模型在自然文本上訓練時學會使用具有主導周期的週期性特徵來表示數字,這些主導周期為 $T=2, 5, 10$。在本文中,我們確定了這些特徵的兩層次層級:雖然不同方式訓練的 Transformers、線性 RNN、LSTM 和傳統詞嵌入都學會了在傅立葉域中具有周期-$T$ 峰值的特徵,但只有部分模型學會了可以用來線性分類數字 mod-$T$ 的幾何可分特徵。為了解釋這一不一致性,我們證明了傅立葉域的稀疏性是必要但不充分的條件,以實現 mod-$T$ 的幾何可分性。從實證上,我們調查了何時模型訓練產生幾何可分特徵,發現數據、架構、優化器和分詞器都扮演著關鍵角色。特別地,我們確定了模型獲得幾何可分特徵的兩種不同途徑:它們可以從一般語言數據中的互補共現信號學習,包括文本-數字共現和跨數字互動,或從多標記(但不是單標記)加法問題中學習。總體而言,我們的結果突顯了特徵學習中的趨同演化現象:多樣化的模型從不同的訓練信號中學習到相似的特徵。

Automatic Ontology Construction Using LLMs as an External Layer of Memory, Verification, and Planning for Hybrid Intelligent Systems

2604.20795v1 by Pavel Salovskii, Iuliia Gorshkova

This paper presents a hybrid architecture for intelligent systems in which large language models (LLMs) are extended with an external ontological memory layer. Instead of relying solely on parametric knowledge and vector-based retrieval (RAG), the proposed approach constructs and maintains a structured knowledge graph using RDF/OWL representations, enabling persistent, verifiable, and semantically grounded reasoning. The core contribution is an automated pipeline for ontology construction from heterogeneous data sources, including documents, APIs, and dialogue logs. The system performs entity recognition, relation extraction, normalization, and triple generation, followed by validation using SHACL and OWL constraints, and continuous graph updates. During inference, LLMs operate over a combined context that integrates vector-based retrieval with graph-based reasoning and external tool interaction. Experimental observations on planning tasks, including the Tower of Hanoi benchmark, indicate that ontology augmentation improves performance in multi-step reasoning scenarios compared to baseline LLM systems. In addition, the ontology layer enables formal validation of generated outputs, transforming the system into a generation-verification-correction pipeline. The proposed architecture addresses key limitations of current LLM-based systems, including lack of long-term memory, weak structural understanding, and limited reasoning capabilities. It provides a foundation for building agent-based systems, robotics applications, and enterprise AI solutions that require persistent knowledge, explainability, and reliable decision-making.

摘要:這篇論文提出了一種混合架構,用於智能系統,其中大型語言模型(LLMs)擴展了外部本體記憶層。該方法不僅依賴於參數知識和基於向量的檢索(RAG),而是構建並維護一個使用RDF/OWL表示的結構化知識圖譜,從而實現持久、可驗證和語義基礎的推理。核心貢獻是一個自動化的本體構建管道,來自異質數據源,包括文檔、API和對話日誌。系統執行實體識別、關係提取、標準化和三元組生成,然後使用SHACL和OWL約束進行驗證,並持續更新圖譜。在推理過程中,LLMs在一個結合的上下文中運行,該上下文整合了基於向量的檢索、基於圖譜的推理和外部工具互動。對於規劃任務的實驗觀察,包括河內塔基準,表明本體增強在多步推理場景中相較於基線LLM系統提高了性能。此外,本體層使生成輸出的正式驗證成為可能,將系統轉變為生成-驗證-修正管道。所提出的架構解決了當前基於LLM的系統的關鍵限制,包括缺乏長期記憶、結構理解薄弱和推理能力有限。它為構建基於代理的系統、機器人應用和需要持久知識、可解釋性和可靠決策的企業AI解決方案提供了基礎。

Can "AI" Be a Doctor? A Study of Empathy, Readability, and Alignment in Clinical LLMs

2604.20791v1 by Mariano Barone, Francesco Di Serio, Roberto Moio, Marco Postiglione, Giuseppe Riccio, Antonio Romano, Vincenzo Moscato

Large Language Models (LLMs) are increasingly deployed in healthcare, yet their communicative alignment with clinical standards remains insufficiently quantified. We conduct a multidimensional evaluation of general-purpose and domain-specialized LLMs across structured medical explanations and real-world physician-patient interactions, analyzing semantic fidelity, readability, and affective resonance. Baseline models amplify affective polarity relative to physicians (Very Negative: 43.14-45.10% vs. 37.25%) and, in larger architectures such as GPT-5 and Claude, produce substantially higher linguistic complexity (FKGL up to 16.91-17.60 vs. 11.47-12.50 in physician-authored responses). Empathy-oriented prompting reduces extreme negativity and lowers grade-level complexity (up to -6.87 FKGL points for GPT-5) but does not significantly increase semantic fidelity. Collaborative rewriting yields the strongest overall alignment. Rephrase configurations achieve the highest semantic similarity to physician answers (up to mean = 0.93) while consistently improving readability and reducing affective extremity. Dual stakeholder evaluation shows that no model surpasses physicians on epistemic criteria, whereas patients consistently prefer rewritten variants for clarity and emotional tone. These findings suggest that LLMs function most effectively as collaborative communication enhancers rather than replacements for clinical expertise.

摘要:大型語言模型(LLMs)在醫療保健領域的應用日益增多,但它們與臨床標準的溝通對齊程度仍然不足以量化。我們對通用型和專業領域的LLMs進行了多維度評估,涵蓋結構化的醫療解釋和現實世界的醫生-病人互動,分析語義忠實度、可讀性和情感共鳴。基準模型相對於醫生增強了情感極性(非常負面:43.14-45.10% vs. 37.25%),而在更大的架構如GPT-5和Claude中,產生了顯著更高的語言複雜性(FKGL高達16.91-17.60 vs. 11.47-12.50在醫生撰寫的回答中)。以同理心為導向的提示減少了極端的負面情緒並降低了年級水平的複雜性(對於GPT-5高達-6.87 FKGL點),但並未顯著提高語義忠實度。協作重寫產生了最強的整體對齊。重述配置實現了與醫生回答的最高語義相似度(平均高達0.93),同時持續改善可讀性並減少情感極端性。雙方利益相關者的評估顯示,沒有模型在認知標準上超越醫生,而病人則持續偏好重寫的變體以獲得清晰度和情感語調。這些發現表明,LLMs作為協作溝通增強工具的功能最為有效,而非臨床專業知識的替代品。

Coverage, Not Averages: Semantic Stratification for Trustworthy Retrieval Evaluation

2604.20763v1 by Andrew Klearman, Radu Revutchi, Rohin Garg, Rishav Chakravarti, Samuel Marc Denton, Yuan Xue

Retrieval quality is the primary bottleneck for accuracy and robustness in retrieval-augmented generation (RAG). Current evaluation relies on heuristically constructed query sets, which introduce a hidden intrinsic bias. We formalize retrieval evaluation as a statistical estimation problem, showing that metric reliability is fundamentally limited by the evaluation-set construction. We further introduce \emph{semantic stratification}, which grounds evaluation in corpus structure by organizing documents into an interpretable global space of entity-based clusters and systematically generating queries for missing strata. This yields (1) formal semantic coverage guarantees across retrieval regimes and (2) interpretable visibility into retrieval failure modes. Experiments across multiple benchmarks and retrieval methods validate our framework. The results expose systematic coverage gaps, identify structural signals that explain variance in retrieval performance, and show that stratified evaluation yields more stable and transparent assessments while supporting more trustworthy decision-making than aggregate metrics.

摘要:檢索質量是檢索增強生成(RAG)中準確性和穩健性的主要瓶頸。當前的評估依賴於啟發式構建的查詢集,這引入了隱藏的內在偏見。我們將檢索評估形式化為一個統計估計問題,顯示度量的可靠性根本上受到評估集構建的限制。我們進一步引入\emph{語義分層},通過將文檔組織成可解釋的基於實體的全局集群空間來將評估基於語料庫結構,並系統地生成缺失層次的查詢。這產生了(1)在檢索模式下的正式語義覆蓋保證和(2)對檢索失敗模式的可解釋性可見性。
在多個基準和檢索方法上的實驗驗證了我們的框架。結果揭示了系統性的覆蓋缺口,識別了解釋檢索性能變異的結構信號,並顯示分層評估產生了更穩定和透明的評估,同時支持比聚合度量更值得信賴的決策。

Participatory provenance as representational auditing for AI-mediated public consultation

2604.20711v1 by Sachit Mahajan

Artificial intelligence is increasingly deployed to synthesize large-scale public input in policy consultations and participatory processes. Yet no formal framework exists for auditing whether these summaries faithfully represent the source population, an accountability gap that existing approaches to AI explainability, grounding and hallucination detection do not address because they focus on output quality rather than input fidelity. Here, participatory provenance is introduced: a measurement framework grounded in optimal transport theory, causal inference and semantic analysis that tracks how individual public submissions are transformed, filtered or lost through AI-mediated summarization. Applied to Canada's 2025-2026 national AI Strategy consultation ($n = 5{,}253$ respondents across two independent policy topics), the framework reveals that both official government summaries underperform a random-participant baseline ($-9.1\%$ and $-8.0\%$ coverage degradation), with $16.9\%$ and $15.3\%$ of participants effectively excluded. Exclusion concentrates in clusters expressing dissent, scepticism and critique of AI ($33$-$88\%$ exclusion rates). Brevity, semantic isolation and rhetorical register independently predict representational outcome. An accompanying open-source interactive tool, the Co-creation Provenance Lab, enables policymakers to audit and iteratively improve summaries, establishing genuine human-in-the-loop oversight at scale.

摘要:人工智慧越來越多地被用來合成大規模的公共意見,以用於政策諮詢和參與過程。然而,並不存在正式的框架來審核這些摘要是否忠實地代表了來源人口,這是一個問責缺口,現有的 AI 可解釋性、基礎和幻覺檢測方法並未解決這個問題,因為它們專注於輸出質量而非輸入的真實性。在此,引入了參與性來源:這是一個基於最佳運輸理論、因果推斷和語義分析的測量框架,用於追蹤個別公共提交如何通過 AI 介導的摘要進行轉換、過濾或丟失。應用於加拿大 2025-2026 年國家 AI 策略諮詢($n = 5{,}253$ 名受訪者,涵蓋兩個獨立的政策主題),該框架揭示官方政府摘要的表現均低於隨機參與者基準($-9.1\%$ 和 $-8.0\%$ 覆蓋率下降),有 $16.9\%$ 和 $15.3\%$ 的參與者實際上被排除在外。排除主要集中在表達異議、懷疑和對 AI 進行批評的群體中($33$-$88\%$ 的排除率)。簡潔性、語義孤立和修辭風格獨立預測代表性結果。一個伴隨的開源互動工具——共創來源實驗室,使用於政策制定者能夠審核並迭代改進摘要,建立真正的人類介入監督。

RSRCC: A Remote Sensing Regional Change Comprehension Benchmark Constructed via Retrieval-Augmented Best-of-N Ranking

2604.20623v1 by Roie Kazoom, Yotam Gigi, George Leifman, Tomer Shekel, Genady Beryozkin

Traditional change detection identifies where changes occur, but does not explain what changed in natural language. Existing remote sensing change captioning datasets typically describe overall image-level differences, leaving fine-grained localized semantic reasoning largely unexplored. To close this gap, we present RSRCC, a new benchmark for remote sensing change question-answering containing 126k questions, split into 87k training, 17.1k validation, and 22k test instances. Unlike prior datasets, RSRCC is built around localized, change-specific questions that require reasoning about a particular semantic change. To the best of our knowledge, this is the first remote sensing change question-answering benchmark designed explicitly for such fine-grained reasoning-based supervision. To construct RSRCC, we introduce a hierarchical semi-supervised curation pipeline that uses Best-of-N ranking as a critical final ambiguity-resolution stage. First, candidate change regions are extracted from semantic segmentation masks, then initially screened using an image-text embedding model, and finally validated through retrieval-augmented vision-language curation with Best-of-N ranking. This process enables scalable filtering of noisy and ambiguous candidates while preserving semantically meaningful changes. The dataset is available at https://huggingface.co/datasets/google/RSRCC.

摘要:傳統的變化檢測識別變化發生的位置,但並不解釋自然語言中發生了什麼變化。現有的遙感變化標題數據集通常描述整體圖像級別的差異,幾乎沒有探索細粒度的局部語義推理。為了填補這一空白,我們提出了RSRCC,這是一個新的遙感變化問答基準,包含126,000個問題,分為87,000個訓練、17,100個驗證和22,000個測試實例。與之前的數據集不同,RSRCC圍繞著局部的、特定於變化的問題構建,這些問題需要對特定的語義變化進行推理。據我們所知,這是第一個專門為這種細粒度推理基礎的監督設計的遙感變化問答基準。為了構建RSRCC,我們引入了一個分層的半監督策劃管道,使用Best-of-N排名作為關鍵的最終模糊解決階段。首先,從語義分割掩膜中提取候選變化區域,然後使用圖像-文本嵌入模型進行初步篩選,最後通過增強檢索的視覺-語言策劃進行驗證,並使用Best-of-N排名。這一過程使得在保留語義上有意義的變化的同時,能夠對嘈雜和模糊的候選進行可擴展的過濾。該數據集可在 https://huggingface.co/datasets/google/RSRCC 獲得。

Evian: Towards Explainable Visual Instruction-tuning Data Auditing

2604.20544v1 by Zimu Jia, Mingjie Xu, Andrew Estornell, Jiaheng Wei

The efficacy of Large Vision-Language Models (LVLMs) is critically dependent on the quality of their training data, requiring a precise balance between visual fidelity and instruction-following capability. Existing datasets, however, are plagued by inconsistent quality, and current data filtering methods rely on coarse-grained scores that lack the granularity to identify nuanced semantic flaws like logical fallacies or factual errors. This creates a fundamental bottleneck in developing more reliable models. To address this, we make three core contributions. First, we construct a large-scale, 300K-sample benchmark by systematically injecting diverse, subtle defects to provide a challenging testbed for data auditing. Second, we introduce a novel "Decomposition-then-Evaluation" paradigm that breaks model responses into constituent cognitive components: visual description, subjective inference, and factual claim, enabling targeted analysis. Third, we instantiate this paradigm via EVIAN (Explainable Visual Instruction-tuning Data AuditiNg), an automated framework that evaluates these components along the orthogonal axes of Image-Text Consistency, Logical Coherence, and Factual Accuracy. Our empirical findings challenge the prevailing scale-centric paradigm: a model fine-tuned on a compact, high-quality subset curated by EVIAN consistently surpassed models trained on orders-of-magnitude larger datasets. We also reveal that dividing complex auditing into verifiable subtasks enables robust curation, and that Logical Coherence is the most critical factor in data quality evaluation.

摘要:大型視覺語言模型(LVLMs)的效能在很大程度上取決於其訓練數據的質量,這需要在視覺真實性和遵循指令的能力之間達成精確的平衡。 然而,現有數據集存在質量不一致的問題,目前的數據過濾方法依賴於粗糙的評分,缺乏識別邏輯謬誤或事實錯誤等細微語義缺陷的細緻度。 這在開發更可靠的模型方面造成了根本性的瓶頸。 為了解決這個問題,我們提出了三個核心貢獻。 首先,我們通過系統性地注入多樣化的微妙缺陷來構建一個大規模的300K樣本基準,以提供一個具有挑戰性的數據審核測試平台。 其次,我們引入了一種新穎的“分解-然後-評估”範式,將模型的反應分解為構成的認知組件:視覺描述、主觀推理和事實主張,從而實現有針對性的分析。 第三,我們通過EVIAN(可解釋的視覺指令調整數據審核)來實現這一範式,這是一個自動化框架,沿著圖像-文本一致性、邏輯一致性和事實準確性這三個正交軸對這些組件進行評估。 我們的實證發現挑戰了當前以規模為中心的範式:在EVIAN精心策劃的緊湊高質量子集上進行微調的模型,始終超越在數量級更大的數據集上訓練的模型。 我們還揭示了將複雜的審核劃分為可驗證的子任務可以實現穩健的策劃,而邏輯一致性是數據質量評估中最關鍵的因素。

MedSkillAudit: A Domain-Specific Audit Framework for Medical Research Agent Skills

2604.20441v1 by Yingyong Hou, Xinyuan Lao, Huimei Wang, Qianyu Yao, Wei Chen, Bocheng Huang, Fei Sun, Yuxian Lv, Weiqi Lei, Xueqian Wen, Pengfei Xia, Zhujun Tan, Shengyang Xie

Background: Agent skills are increasingly deployed as modular, reusable capability units in AI agent systems. Medical research agent skills require safeguards beyond general-purpose evaluation, including scientific integrity, methodological validity, reproducibility, and boundary safety. This study developed and preliminarily evaluated a domain-specific audit framework for medical research agent skills, with a focus on reliability against expert review. Methods: We developed MedSkillAudit (skill-auditor@1.0), a layered framework assessing skill release readiness before deployment. We evaluated 75 skills across five medical research categories (15 per category). Two experts independently assigned a quality score (0-100), an ordinal release disposition (Production Ready / Limited Release / Beta Only / Reject), and a high-risk failure flag. System-expert agreement was quantified using ICC(2,1) and linearly weighted Cohen's kappa, benchmarked against the human inter-rater baseline. Results: The mean consensus quality score was 72.4 (SD = 13.0); 57.3% of skills fell below the Limited Release threshold. MedSkillAudit achieved ICC(2,1) = 0.449 (95% CI: 0.250-0.610), exceeding the human inter-rater ICC of 0.300. System-consensus score divergence (SD = 9.5) was smaller than inter-expert divergence (SD = 12.4), with no directional bias (Wilcoxon p = 0.613). Protocol Design showed the strongest category-level agreement (ICC = 0.551); Academic Writing showed a negative ICC (-0.567), reflecting a structural rubric-expert mismatch. Conclusions: Domain-specific pre-deployment audit may provide a practical foundation for governing medical research agent skills, complementing general-purpose quality checks with structured audit workflows tailored to scientific use cases.

摘要:背景:代理技能越來越多地作為模組化、可重用的能力單元在人工智慧代理系統中部署。醫學研究代理技能需要超越一般評估的保障,包括科學誠信、方法論有效性、可重複性和邊界安全。本研究開發並初步評估了一個針對醫學研究代理技能的領域特定審核框架,重點關注對專家評審的可靠性。方法:我們開發了 MedSkillAudit (skill-auditor@1.0),這是一個分層框架,用於在部署前評估技能釋放的準備狀態。我們評估了五個醫學研究類別中的 75 項技能(每個類別 15 項)。兩位專家獨立地分配了一個質量分數(0-100)、一個序數釋放處置(生產就緒 / 限量釋放 / 僅限測試版 / 拒絕)和一個高風險失敗標誌。系統專家之間的協議使用 ICC(2,1) 和線性加權的 Cohen's kappa 進行量化,並以人類評分者基準進行基準測試。結果:共識質量分數的平均值為 72.4(標準差 = 13.0);57.3% 的技能低於限量釋放的門檻。MedSkillAudit 達到了 ICC(2,1) = 0.449(95% CI: 0.250-0.610),超過了人類評分者的 ICC 0.300。系統共識分數的差異(標準差 = 9.5)小於專家之間的差異(標準差 = 12.4),且沒有方向性偏差(Wilcoxon p = 0.613)。協議設計顯示出最強的類別級別協議(ICC = 0.551);學術寫作顯示出負的 ICC(-0.567),反映出結構性評分標準與專家之間的不匹配。結論:領域特定的預部署審核可能為治理醫學研究代理技能提供實用的基礎,通過針對科學用例量身定制的結構化審核工作流程來補充一般性質量檢查。

Surrogate modeling for interpreting black-box LLMs in medical predictions

2604.20331v2 by Changho Han, Songsoo Kim, Dong Won Kim, Leo Anthony Celi, Jaewoong Kim, SungA Bae, Dukyong Yoon

Large language models (LLMs), trained on vast datasets, encode extensive real-world knowledge within their parameters, yet their black-box nature obscures the mechanisms and extent of this encoding. Surrogate modeling, which uses simplified models to approximate complex systems, can offer a path toward better interpretability of black-box models. We propose a surrogate modeling framework that quantitatively explains LLM-encoded knowledge. For a specific hypothesis derived from domain knowledge, this framework approximates the latent LLM knowledge space using observable elements (input-output pairs) through extensive prompting across a comprehensive range of simulated scenarios. Through proof-of-concept experiments in medical predictions, we demonstrate our framework's effectiveness in revealing the extent to which LLMs "perceive" each input variable in relation to the output. Particularly, given concerns that LLMs may perpetuate inaccuracies and societal biases embedded in their training data, our experiments using this framework quantitatively revealed both associations that contradict established medical knowledge and the persistence of scientifically refuted racial assumptions within LLM-encoded knowledge. By disclosing these issues, our framework can act as a red-flag indicator to support the safe and reliable application of these models.

摘要:大型語言模型(LLMs)在龐大的數據集上進行訓練,將廣泛的現實世界知識編碼在其參數中,但其黑箱特性使得這種編碼的機制和範圍變得不明朗。代理建模使用簡化模型來近似複雜系統,可以為黑箱模型的更好可解釋性提供一條途徑。我們提出了一個代理建模框架,定量解釋LLM編碼的知識。對於從領域知識衍生的特定假設,該框架通過在一系列綜合模擬場景中進行廣泛的提示,使用可觀察的元素(輸入-輸出對)來近似潛在的LLM知識空間。通過在醫療預測中的概念驗證實驗,我們展示了該框架在揭示LLMs如何「感知」每個輸入變量與輸出之間的關係方面的有效性。特別是,考慮到LLMs可能會延續其訓練數據中固有的不準確性和社會偏見,我們使用該框架的實驗定量揭示了與既有醫學知識相矛盾的關聯以及LLM編碼知識中科學上被駁斥的種族假設的持續存在。通過揭示這些問題,我們的框架可以作為紅旗指標,以支持這些模型的安全和可靠應用。

Stateless Decision Memory for Enterprise AI Agents

2604.20158v1 by Vasundra Srinivasan

Enterprise deployment of long-horizon decision agents in regulated domains (underwriting, claims adjudication, tax examination) is dominated by retrieval-augmented pipelines despite a decade of increasingly sophisticated stateful memory architectures. We argue this reflects a hidden requirement: regulated deployment is load-bearing on four systems properties (deterministic replay, auditable rationale, multi-tenant isolation, statelessness for horizontal scale), and stateful architectures violate them by construction. We propose Deterministic Projection Memory (DPM): an append-only event log plus one task-conditioned projection at decision time. On ten regulated decisioning cases at three memory budgets, DPM matches summarization-based memory at generous budgets and substantially outperforms it when the budget binds: at a 20x compression ratio, DPM improves factual precision by +0.52 (Cohen's h=1.17, p=0.0014) and reasoning coherence by +0.53 (h=1.13, p=0.0034), paired permutation, n=10. DPM is additionally 7-15x faster at binding budgets, making one LLM call at decision time instead of N. A determinism study of 10 replays per case at temperature zero shows both architectures inherit residual API-level nondeterminism, but the asymmetry is structural: DPM exposes one nondeterministic call; summarization exposes N compounding calls. The audit surface follows the same one-versus-N pattern: DPM logs two LLM calls per decision while summarization logs 83-97 on LongHorizon-Bench. We conclude with TAMS, a practitioner heuristic for architecture selection, and a failure analysis of stateful memory under enterprise operating conditions. The contribution is the argument that statelessness is the load-bearing property explaining enterprise's preference for weaker but replayable retrieval pipelines, and that DPM demonstrates this property is attainable without the decisioning penalty retrieval pays.

摘要:企業在受規範領域(承保、索賠裁定、稅務檢查)中部署長期決策代理時,儘管經過十年的不斷進步的有狀態記憶架構,仍以檢索增強管道為主導。我們認為這反映了一個隱藏的需求:受規範的部署在四個系統屬性(確定性重放、可審計的理由、多租戶隔離、無狀態以實現水平擴展)上是承載的,而有狀態架構在結構上違反了這些屬性。我們提出確定性投影記憶(DPM):一個僅附加事件日誌,加上一個在決策時條件化的投影。在三個記憶預算下的十個受規範決策案例中,DPM 在寬鬆預算下與基於摘要的記憶相匹配,並在預算緊張時顯著超越它:在 20 倍壓縮比下,DPM 提高了事實精確度 +0.52(Cohen's h=1.17,p=0.0014)和推理一致性 +0.53(h=1.13,p=0.0034),配對排列,n=10。DPM 在綁定預算時還快 7-15 倍,決策時只需進行一次 LLM 調用,而不是 N 次。對於每個案例在零溫度下進行的 10 次重放的確定性研究顯示,兩種架構都繼承了殘餘的 API 級非確定性,但不對稱性是結構性的:DPM 暴露了一次非確定性調用;而摘要則暴露了 N 次累積調用。審計表面遵循相同的一對 N 模式:DPM 每次決策記錄兩次 LLM 調用,而摘要在 LongHorizon-Bench 上記錄 83-97 次。我們以 TAMS 作為架構選擇的實踐者啟發法結束,並對企業運營條件下的有狀態記憶進行失敗分析。貢獻在於論證無狀態性是解釋企業偏好較弱但可重放的檢索管道的承載屬性,並且 DPM 展示了這一屬性可以在不承擔決策懲罰的情況下實現。

From Fuzzy to Formal: Scaling Hospital Quality Improvement with AI

2604.20055v1 by Patrick Vossler, Jean Feng, Venkat Sivaraman, Robert Gallo, Hemal Kanzaria, Dana Freiser, Christopher Ross, Amy Ou, James Marks, Susan Ehrlich, Christopher Peabody, Lucas Zier

Hospital Quality Improvement (QI) plays a critical role in optimizing healthcare delivery by translating high-level hospital goals into actionable solutions. A critical step of QI is to identify the key modifiable contributing factors, a process we call QI factor discovery, typically through expert-driven semi-structured qualitative tools like fishbone diagrams, chart reviews, and Lean Healthcare methods. AI has the potential to transform and accelerate QI factor discovery, which is traditionally time- and resource-intensive and limited in reproducibility and auditability. Nevertheless, current AI alignment methods assume the task is well-defined, whereas QI factor discovery is an exploratory, fuzzy, and iterative sense-making process that relies on complex implicit expert judgments. To design an AI pipeline that formalizes the QI process while preserving its exploratory components, we propose viewing the task as learning not only LLM prompts but also the overarching natural-language specifications. In particular, we map QI factor discovery to steps of the classical AI/ML development process (problem formalization, model learning, and model validation) where the specifications are tunable hyperparameters. Domain experts and AI agents iteratively refine both the overarching specifications and AI pipeline until AI extractions are concordant with expert annotations and aligned with clinical objectives. We applied this "Human-AI Spec-Solution Co-optimization" framework at an urban safety-net hospital to identify factors driving prolonged length of stay and unplanned 30-day readmissions. The resulting AI-for-QI pipelines achieved $\ge 70\%$ concordance with expert annotations. Compared to prior manual Lean analyses, the AI pipeline was substantially more efficient, recovered previous findings, surfaced new modifiable factors, and produced auditable reasoning traces.

摘要:醫院品質改善(QI)在優化醫療服務中扮演著關鍵角色,通過將高層次的醫院目標轉化為可行的解決方案。QI 的一個關鍵步驟是識別主要的可修改貢獻因素,我們稱之為 QI 因素發現,通常通過專家驅動的半結構化質性工具,如魚骨圖、圖表回顧和精益醫療方法來進行。人工智慧有潛力轉變和加速 QI 因素發現,這一過程傳統上耗時且資源密集,且在可重複性和可審計性方面受限。然而,目前的 AI 對齊方法假設任務是明確定義的,而 QI 因素發現是一個探索性、模糊且迭代的意義建構過程,依賴於複雜的隱性專家判斷。為了設計一個正式化 QI 過程的 AI 管道,同時保留其探索性組件,我們建議將任務視為學習不僅是 LLM 提示,還有整體的自然語言規範。具體來說,我們將 QI 因素發現映射到傳統 AI/ML 開發過程的步驟(問題形式化、模型學習和模型驗證),其中規範是可調的超參數。領域專家和 AI 代理反覆完善整體規範和 AI 管道,直到 AI 提取結果與專家標註一致並與臨床目標對齊。我們在一所城市安全網醫院應用這一「人類-AI 規範-解決方案共同優化」框架,以識別驅動延長住院時間和未計劃 30 天再入院的因素。最終的 AI-for-QI 管道與專家標註達到了 $\ge 70\%$ 的一致性。與之前的手動精益分析相比,AI 管道的效率顯著提高,恢復了先前的發現,揭示了新的可修改因素,並生成了可審計的推理痕跡。

TriEx: A Game-based Tri-View Framework for Explaining Internal Reasoning in Multi-Agent LLMs

2604.20043v1 by Ziyi Wang, Chen Zhang, Wenjun Peng, Qi Wu, Xinyu Wang

Explainability for Large Language Model (LLM) agents is especially challenging in interactive, partially observable settings, where decisions depend on evolving beliefs and other agents. We present \textbf{TriEx}, a tri-view explainability framework that instruments sequential decision making with aligned artifacts: (i) structured first-person self-reasoning bound to an action, (ii) explicit second-person belief states about opponents updated over time, and (iii) third-person oracle audits grounded in environment-derived reference signals. This design turns explanations from free-form narratives into evidence-anchored objects that can be compared and checked across time and perspectives. Using imperfect-information strategic games as a controlled testbed, we show that TriEx enables scalable analysis of explanation faithfulness, belief dynamics, and evaluator reliability, revealing systematic mismatches between what agents say, what they believe, and what they do. Our results highlight explainability as an interaction-dependent property and motivate multi-view, evidence-grounded evaluation for LLM agents. Code is available at https://github.com/Einsam1819/TriEx.

摘要:大型語言模型(LLM)代理的可解釋性在互動的部分可觀察環境中尤其具有挑戰性,因為決策依賴於不斷演變的信念和其他代理。我們提出了\textbf{TriEx},一個三視角可解釋性框架,為序列決策提供了對齊的工具:(i) 與行動相關的結構化第一人稱自我推理,(ii) 隨時間更新的關於對手的明確第二人稱信念狀態,以及 (iii) 基於環境衍生參考信號的第三人稱預言者審計。這種設計將解釋從自由形式的敘述轉變為可以跨時間和視角進行比較和檢查的證據基礎對象。使用不完全信息的策略遊戲作為受控測試平台,我們展示了TriEx使解釋的真實性、信念動態和評估者可靠性的可擴展分析成為可能,揭示了代理所說的、所信的和所做的之間的系統性不匹配。我們的結果強調可解釋性是一種依賴互動的特性,並促進了對LLM代理的多視角、基於證據的評估。代碼可在 https://github.com/Einsam1819/TriEx 獲得。

Cross-Model Consistency of AI-Generated Exercise Prescriptions: A Repeated Generation Study Across Three Large Language Models

2604.19598v2 by Kihyuk Lee

This study compared repeated generation consistency of exercise prescription outputs across three large language models (LLMs), specifically GPT-4.1, Claude Sonnet 4.6, and Gemini 2.5 Flash, under temperature=0 conditions. Each model generated prescriptions for six clinical scenarios 20 times, yielding 360 total outputs analyzed across four dimensions: semantic similarity, output reproducibility, FITT classification, and safety expression. Mean semantic similarity was highest for GPT-4.1 (0.955), followed by Gemini 2.5 Flash (0.950) and Claude Sonnet 4.6 (0.903), with significant inter-model differences confirmed (H = 458.41, p < .001). Critically, these scores reflected fundamentally different generative behaviors: GPT-4.1 produced entirely unique outputs (100%) with stable semantic content, while Gemini 2.5 Flash showed pronounced output repetition (27.5% unique outputs), indicating that its high similarity score derived from text duplication rather than consistent reasoning. Identical decoding settings thus yielded fundamentally different consistency profiles, a distinction that single-output evaluations cannot capture. Safety expression reached ceiling levels across all models, confirming its limited utility as a differentiating metric. These results indicate that model selection constitutes a clinical rather than merely technical decision, and that output behavior under repeated generation conditions should be treated as a core criterion for reliable deployment of LLM-based exercise prescription systems.

摘要:這項研究比較了在 temperature=0 條件下,三個大型語言模型(LLMs)產生的運動處方輸出的一致性,具體為 GPT-4.1、Claude Sonnet 4.6 和 Gemini 2.5 Flash。每個模型為六個臨床場景生成了 20 次處方,總共分析了 360 個輸出,涵蓋四個維度:語義相似性、輸出可重複性、FITT 分類和安全性表達。GPT-4.1 的平均語義相似性最高(0.955),其次是 Gemini 2.5 Flash(0.950)和 Claude Sonnet 4.6(0.903),並確認了模型間的顯著差異(H = 458.41, p < .001)。關鍵是,這些分數反映了根本不同的生成行為:GPT-4.1 產生了完全獨特的輸出(100%),並且語義內容穩定,而 Gemini 2.5 Flash 則顯示出明顯的輸出重複(27.5% 獨特輸出),這表明其高相似性分數源於文本重複,而非一致的推理。因此,相同的解碼設置產生了根本不同的一致性特徵,這一區別是單一輸出評估無法捕捉的。所有模型的安全性表達達到了上限水平,確認其作為區分指標的有限效用。這些結果表明,模型選擇是一個臨床而非僅僅是技術的決策,並且在重複生成條件下的輸出行為應被視為可靠部署基於 LLM 的運動處方系統的核心標準。

Integrating Anomaly Detection into Agentic AI for Proactive Risk Management in Human Activity

2604.19538v1 by Farbod Zorriassatine, Ahmad Lotfi

Agentic AI, with goal-directed, proactive, and autonomous decision-making capabilities, offers a compelling opportunity to address movement-related risks in human activity, including the persistent hazard of falls among elderly populations. Despite numerous approaches to fall mitigation through fall prediction and detection, existing systems have not yet functioned as universal solutions across care pathways and safety-critical environments. This is largely due to limitations in consistently handling real-world complexity, particularly poor context awareness, high false alarm rates, environmental noise, and data scarcity. We argue that fall detection and fall prediction can usefully be formulated as anomaly detection problems and more effectively addressed through an agentic AI system. More broadly, this perspective enables the early identification of subtle deviations in movement patterns associated with increased risk, whether arising from age-related decline, fatigue, or environmental factors. While technical requirements for immediate deployment are beyond the scope of this paper, we propose a conceptual framework that highlights potential value. This framework promotes a well-orchestrated approach to risk management by dynamically selecting relevant tools and integrating them into adaptive decision-making workflows, rather than relying on static configurations tailored to narrowly defined scenarios.

摘要:代理人工智慧具備目標導向、主動和自主決策能力,為解決人類活動中與運動相關的風險提供了引人注目的機會,包括老年人群中持續存在的跌倒危險。儘管已有多種方法針對跌倒進行預測和檢測以減少風險,但現有系統尚未能在護理路徑和安全關鍵環境中作為普遍解決方案運作。這主要是因為在一致處理現實世界的複雜性方面存在限制,特別是缺乏良好的上下文意識、高誤報率、環境噪音和數據稀缺。我們認為,跌倒檢測和跌倒預測可以有效地被構思為異常檢測問題,並透過代理人工智慧系統更有效地加以解決。更廣泛地說,這一觀點使得能夠及早識別與增加風險相關的運動模式中的微妙偏差,無論是由於年齡相關的衰退、疲勞還是環境因素引起的。雖然即時部署的技術要求超出了本文的範疇,但我們提出了一個概念框架,突顯潛在的價值。這個框架促進了一種精心協調的風險管理方法,通過動態選擇相關工具並將其整合到自適應決策工作流程中,而不是依賴於針對狹窄定義場景量身定制的靜態配置。

EVPO: Explained Variance Policy Optimization for Adaptive Critic Utilization in LLM Post-Training

2604.19485v1 by Chengjun Pan, Shichun Liu, Jiahang Lin, Dingwei Zhu, Jiazheng Zhang, Shihan Dou, Songyang Gao, Zhenhua Han, Binghai Wang, Rui Zheng, Xuanjing Huang, Tao Gui, Yansong Feng

Reinforcement learning (RL) for LLM post-training faces a fundamental design choice: whether to use a learned critic as a baseline for policy optimization. Classical theory favors critic-based methods such as PPO for variance reduction, yet critic-free alternatives like GRPO have gained widespread adoption due to their simplicity and competitive performance. We show that in sparse-reward settings, a learned critic can inject estimation noise that exceeds the state signal it captures, increasing rather than reducing advantage variance. By casting baseline selection as a Kalman filtering problem, we unify PPO and GRPO as two extremes of the Kalman gain and prove that explained variance (EV), computable from a single training batch, identifies the exact boundary: positive EV indicates the critic reduces variance, while zero or negative EV signals that it inflates variance. Building on this insight, we propose Explained Variance Policy Optimization (EVPO), which monitors batch-level EV at each training step and adaptively switches between critic-based and batch-mean advantage estimation, provably achieving no greater variance than the better of the two at every step. Across four tasks spanning classical control, agentic interaction, and mathematical reasoning, EVPO consistently outperforms both PPO and GRPO regardless of which fixed baseline is stronger on a given task. Further analysis confirms that the adaptive gating tracks critic maturation over training and that the theoretically derived zero threshold is empirically optimal.

摘要:強化學習(RL)在大型語言模型(LLM)後訓練中面臨一個基本的設計選擇:是否使用學習到的評論家作為政策優化的基準。經典理論偏好基於評論家的方法,如PPO,以減少方差,但無評論家的替代方案如GRPO因其簡單性和競爭性表現而廣泛被採用。我們表明,在稀疏獎勵的環境中,學習到的評論家可能會注入超過其捕捉的狀態信號的估計噪聲,從而增加而不是減少優勢方差。通過將基準選擇視為卡爾曼濾波問題,我們統一了PPO和GRPO作為卡爾曼增益的兩個極端,並證明了可以從單個訓練批次計算的解釋方差(EV)確定了確切的邊界:正EV表示評論家減少方差,而零或負EV則表明它增加方差。基於這一見解,我們提出了解釋方差政策優化(EVPO),該方法在每個訓練步驟中監控批次級EV,並自適應地在基於評論家和批次均值優勢估計之間切換,證明在每一步都不會比兩者中較好的那一個具有更大的方差。在涵蓋經典控制、代理互動和數學推理的四個任務中,EVPO始終超越PPO和GRPO,無論在特定任務上哪個固定基準更強。進一步的分析確認,自適應閘控跟踪評論家的成熟度,並且理論推導的零閾值在實證上是最佳的。

Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents

2604.19457v1 by Vasundra Srininvasan

Long-horizon enterprise agents make high-stakes decisions (loan underwriting, claims adjudication, clinical review, prior authorization) under lossy memory, multi-step reasoning, and binding regulatory constraints. Current evaluation reports a single task-success scalar that conflates distinct failure modes and hides whether an agent is aligned with the standards its deployment environment requires. We propose that long-horizon decision behavior decomposes into four orthogonal alignment axes, each independently measurable and failable: factual precision (FRP), reasoning coherence (RCS), compliance reconstruction (CRR), and calibrated abstention (CAR). CRR is a novel regulatory-grounded axis; CAR is a measurement axis separating coverage from accuracy. We exercise the decomposition on a controlled benchmark (LongHorizon-Bench) covering loan qualification and insurance claims adjudication with deterministic ground-truth construction. Running six memory architectures, we find structure aggregate accuracy cannot see: retrieval collapses on factual precision; schema-anchored architectures pay a scaffolding tax; plain summarization under a fact-preservation prompt is a strong baseline on FRP, RCS, EDA, and CRR; and all six architectures commit on every case, exposing a decisional-alignment axis the field has not targeted. The decomposition also surfaced a pre-registered prediction of our own, that summarization would fail factual recall, which the data reversed at large magnitude, an axis-level reversal aggregate accuracy would have hidden. Institutional alignment (regulatory reconstruction) and decisional alignment (calibrated abstention) are under-represented in the alignment literature and become load-bearing once decisions leave the laboratory. The framework transfers to any regulated decisioning domain via two steps: build a fact schema, and calibrate the CRR auditor prompt.

摘要:長期視野的企業代理在有損記憶、多步驟推理和約束性法規限制下做出高風險決策(貸款承保、索賠裁定、臨床審查、事前授權)。目前的評估報告提供了一個單一的任務成功標量,這混淆了不同的失敗模式,並隱藏了代理是否符合其部署環境所需的標準。我們提出長期決策行為可分解為四個正交的對齊軸,每個軸都是獨立可測量和可失敗的:事實精確性(FRP)、推理一致性(RCS)、合規重建(CRR)和校準放棄(CAR)。CRR是一個新穎的基於法規的軸;CAR是一個測量軸,將覆蓋率與準確性分開。我們在一個受控基準(LongHorizon-Bench)上進行分解,涵蓋貸款資格和保險索賠裁定,並進行確定性真實構建。運行六種記憶架構,我們發現結構聚合準確性無法看到:檢索在事實精確性上崩潰;基於架構的架構支付了支架稅;在事實保留提示下的普通摘要在FRP、RCS、EDA和CRR上是一個強基線;而所有六種架構在每個案例上都犯錯,暴露了一個該領域未針對的決策對齊軸。這一分解還揭示了我們自己預先註冊的預測,即摘要將失敗於事實回憶,數據在大幅度上反轉了這一點,這一軸級反轉的聚合準確性本會隱藏。機構對齊(法規重建)和決策對齊(校準放棄)在對齊文獻中被低估,並且一旦決策離開實驗室,它們便成為承載負荷的要素。該框架通過兩個步驟轉移到任何受規範的決策領域:建立事實架構,並校準CRR審核提示。

TACENR: Task-Agnostic Contrastive Explanations for Node Representations

2604.19372v1 by Vasiliki Papanikou, Evaggelia Pitoura

Graph representation learning has achieved notable success in encoding graph-structured data into latent vector spaces, enabling a wide range of downstream tasks. However, these node representations remain opaque and difficult to interpret. Existing explainability methods primarily focus on supervised settings or on explaining individual representation dimensions, leaving a critical gap in explaining the overall structure of node representations. In this paper, we propose TACENR (Task-Agnostic Contrastive Explanations for Node Representations), a local explanation method that identifies not only attribute features but also proximity and structural ones that contribute the most in the representation space. TACENR builds on contrastive learning, through which we learn a similarity function in the representation space, revealing which are the features that play an important role in the representation of a node. While our focus is on task-agnostic explanations, TACENR can be applied to supervised scenarios as well. Experimental results demonstrate that proximity and structural features play a significant role in shaping node representations and that our supervised variant performs comparably to existing task-specific approaches in identifying the most impactful features.

摘要:圖形表示學習在將圖形結構數據編碼為潛在向量空間方面取得了顯著成功,從而使各種下游任務得以實現。然而,這些節點表示仍然不透明且難以解釋。現有的可解釋性方法主要集中在監督設定或解釋單個表示維度上,這在解釋節點表示的整體結構方面留下了重要的空白。在本文中,我們提出了 TACENR(任務無關的對比解釋節點表示),這是一種局部解釋方法,不僅識別屬性特徵,還識別在表示空間中貢獻最大的接近性和結構特徵。TACENR 建立在對比學習的基礎上,通過這種方式我們學習表示空間中的相似性函數,揭示出在節點表示中扮演重要角色的特徵。雖然我們的重點是任務無關的解釋,但 TACENR 也可以應用於監督場景。實驗結果顯示,接近性和結構特徵在塑造節點表示中起著重要作用,我們的監督變體在識別最具影響力的特徵方面的表現與現有的任務特定方法相當。

Beyond Semantic Similarity: A Component-Wise Evaluation Framework for Medical Question Answering Systems with Health Equity Implications

2604.19281v1 by Abu Noman Md Sakib, Md. Main Oddin Chisty, Zijie Zhang

The use of Large Language Models (LLMs) to support patients in addressing medical questions is becoming increasingly prevalent. However, most of the measures currently used to evaluate the performance of these models in this context only measure how closely a model's answers match semantically, and therefore do not provide a true indication of the model's medical accuracy or of the health equity risks associated with it. To address these shortcomings, we present a new evaluation framework for medical question answering called VB-Score (Verification-Based Score) that provides a separate evaluation of the four components of entity recognition, semantic similarity, factual consistency, and structured information completeness for medical question-answering models. We perform rigorous reviews of the performance of three well-known and widely used LLMs on 48 public health-related topics taken from high-quality, authoritative information sources. Based on our analyses, we discover a major discrepancy between the models' semantic and entity accuracy. Our assessments of the performance of all three models show that each of them has almost uniformly severe performance failures when evaluated against our criteria. Our findings indicate alarming performance disparities across various public health topics, with most of the models exhibiting 13.8% lower performance (compared to an overall average) for all the public health topics that relate to chronic conditions that occur in older and minority populations, which indicates the existence of what's known as condition-based algorithmic discrimination. Our findings also demonstrate that prompt engineering alone does not compensate for basic architectural limitations on how these models perform in extracting medical entities and raise the question of whether semantic evaluation alone is a sufficient measure of medical AI safety.

摘要:使用大型語言模型(LLMs)來支持患者解決醫療問題的做法正變得越來越普遍。然而,目前用於評估這些模型在此背景下表現的措施大多僅衡量模型的答案在語義上的匹配程度,因此並未真正反映模型的醫療準確性或與之相關的健康公平風險。為了解決這些不足,我們提出了一個新的醫療問題回答評估框架,稱為VB-Score(基於驗證的分數),它對醫療問題回答模型的四個組成部分進行單獨評估,包括實體識別、語義相似性、事實一致性和結構化信息完整性。我們對三個知名且廣泛使用的LLMs在48個公共健康相關主題上的表現進行了嚴格評審,這些主題來自高質量、權威的信息來源。根據我們的分析,我們發現模型的語義準確性和實體準確性之間存在重大差異。我們對這三個模型表現的評估顯示,當根據我們的標準進行評估時,每個模型幾乎都存在嚴重的性能失敗。我們的研究結果顯示,在各種公共健康主題之間存在令人擔憂的性能差異,對於與老年人和少數族裔群體中發生的慢性病相關的所有公共健康主題,大多數模型的性能比整體平均水平低13.8%,這表明存在所謂的基於病症的算法歧視。我們的發現還表明,僅僅依靠提示工程並不能彌補這些模型在提取醫療實體方面的基本架構限制,並引發了語義評估是否足以作為醫療AI安全的衡量標準的問題。

Gradient-Based Program Synthesis with Neurally Interpreted Languages

2604.18907v1 by Matthew V. Macfarlane, Clément Bonnet, Herke van Hoof, Levi H. S. Lelis

A central challenge in program induction has long been the trade-off between symbolic and neural approaches. Symbolic methods offer compositional generalisation and data efficiency, yet their scalability is constrained by formalisms such as domain-specific languages (DSLs), which are labour-intensive to create and may not transfer to new domains. In contrast, neural networks flexibly learn from data but tend to generalise poorly in compositional and out-of-distribution settings. We bridge this divide with an instance of a Latent Adaptation Network architecture named Neural Language Interpreter (NLI), which learns its own discrete, symbolic-like programming language end-to-end. NLI autonomously discovers a vocabulary of primitive operations and uses a novel differentiable neural executor to interpret variable-length sequences of these primitives. This allows NLI to represent programs that are not bound to a constant number of computation steps, enabling it to solve more complex problems than those seen during training. To make these discrete, compositional program structures amenable to gradient-based optimisation, we employ the Gumbel-Softmax relaxation, enabling the entire model to be trained end-to-end. Crucially, this same differentiability enables powerful test-time adaptation. At inference, NLI's program inductor provides an initial program guess. This guess is then refined via gradient descent through the neural executor, enabling efficient search for the neural program that best explains the given data. We demonstrate that NLI outperforms in-context learning, test-time training, and continuous latent program networks on tasks that require combinatorial generalisation and rapid adaptation to unseen tasks. Our results establish a new path toward models that combine the compositionality of discrete languages with the gradient-based search and end-to-end learning of neural networks.

摘要:一個中心挑戰在於程序歸納長期以來一直是符號方法和神經方法之間的權衡。符號方法提供了組合性泛化和數據效率,然而它們的可擴展性受到特定領域語言(DSL)等形式主義的限制,這些形式主義的創建需要大量勞動,並且可能無法轉移到新領域。相對而言,神經網絡靈活地從數據中學習,但在組合性和分佈外的設置中往往泛化不佳。我們通過一個名為神經語言解釋器(NLI)的潛在適應網絡架構的實例來彌合這一差距,該架構端到端地學習自己的離散、類符號的編程語言。NLI 自主發現了一組原始操作的詞彙,並使用一種新穎的可微分神經執行器來解釋這些原始操作的可變長度序列。這使得 NLI 能夠表示不受固定計算步驟數限制的程序,從而使其能夠解決比訓練期間看到的更複雜的問題。為了使這些離散的、組合的程序結構適合基於梯度的優化,我們採用了 Gumbel-Softmax 放鬆,使整個模型能夠端到端訓練。至關重要的是,這種可微分性使得強大的測試時適應成為可能。在推理時,NLI 的程序歸納器提供了一個初步的程序猜測。然後,這個猜測通過神經執行器進行梯度下降的精煉,從而有效地搜索最能解釋給定數據的神經程序。我們展示了 NLI 在需要組合性泛化和快速適應未見任務的任務上超越了上下文學習、測試時訓練和連續潛在程序網絡。我們的結果為結合離散語言的組合性與神經網絡的基於梯度的搜索和端到端學習的模型建立了一條新路徑。

AI scientists produce results without reasoning scientifically

2604.18805v1 by Martiño Ríos-García, Nawaf Alampara, Chandan Gupta, Indrajeet Mandal, Sajid Mannan, Ali Asghar Aghajani, N. M. Anoop Krishnan, Kevin Maik Jablonka

Large language model (LLM)-based systems are increasingly deployed to conduct scientific research autonomously, yet whether their reasoning adheres to the epistemic norms that make scientific inquiry self-correcting is poorly understood. Here, we evaluate LLM-based scientific agents across eight domains, spanning workflow execution to hypothesis-driven inquiry, through more than 25,000 agent runs and two complementary lenses: (i) a systematic performance analysis that decomposes the contributions of the base model and the agent scaffold, and (ii) a behavioral analysis of the epistemological structure of agent reasoning. We observe that the base model is the primary determinant of both performance and behavior, accounting for 41.4% of explained variance versus 1.5% for the scaffold. Across all configurations, evidence is ignored in 68% of traces, refutation-driven belief revision occurs in 26%, and convergent multi-test evidence is rare. The same reasoning pattern appears whether the agent executes a computational workflow or conducts hypothesis-driven inquiry. They persist even when agents receive near-complete successful reasoning trajectories as context, and the resulting unreliability compounds across repeated trials in epistemically demanding domains. Thus, current LLM-based agents execute scientific workflows but do not exhibit the epistemic patterns that characterize scientific reasoning. Outcome-based evaluation cannot detect these failures, and scaffold engineering alone cannot repair them. Until reasoning itself becomes a training target, the scientific knowledge produced by such agents cannot be justified by the process that generated it.

摘要:大型語言模型(LLM)基礎的系統越來越多地被部署以自主進行科學研究,但它們的推理是否遵循使科學探究自我修正的認識論規範仍然不甚了解。在這裡,我們評估了基於LLM的科學代理,涵蓋八個領域,從工作流程執行到假設驅動的探究,通過超過25,000次代理運行和兩種互補的視角:(i)系統性能分析,分解基礎模型和代理支架的貢獻,以及(ii)代理推理的認識論結構的行為分析。我們觀察到,基礎模型是性能和行為的主要決定因素,解釋變異的41.4%來自基礎模型,而支架僅佔1.5%。在所有配置中,68%的痕跡中忽略了證據,26%的情況下發生了反駁驅動的信念修正,而收斂的多測試證據則很少出現。無論代理執行計算工作流程還是進行假設驅動的探究,相同的推理模式都會出現。即使當代理接收到幾乎完整的成功推理軌跡作為上下文時,這些模式仍然持續存在,並且在認識論要求高的領域中,隨著重複試驗,導致的可靠性下降會加劇。因此,當前的基於LLM的代理執行科學工作流程,但並未展現出特徵化科學推理的認識論模式。基於結果的評估無法檢測這些失敗,而僅靠支架工程也無法修復它們。在推理本身成為訓練目標之前,這些代理所產生的科學知識無法通過生成它的過程來證明其合理性。

Handling and Interpreting Missing Modalities in Patient Clinical Trajectories via Autoregressive Sequence Modeling

2604.18753v1 by Andrew Wang, Ellie Pavlick, Ritambhara Singh

An active challenge in developing multimodal machine learning (ML) models for healthcare is handling missing modalities during training and deployment. As clinical datasets are inherently temporal and sparse in terms of modality presence, capturing the underlying predictive signal via diagnostic multimodal ML models while retaining model explainability remains an ongoing challenge. In this work, we address this by re-framing clinical diagnosis as an autoregressive sequence modeling task, utilizing causal decoders from large language models (LLMs) to model a patient's multimodal trajectory. We first introduce a missingness-aware contrastive pre-training objective that integrates multiple modalities in datasets with missingness in a shared latent space. We then show that autoregressive sequence modeling with transformer-based architectures outperforms baselines on the MIMIC-IV and eICU fine-tuning benchmarks. Finally, we use interpretability techniques to move beyond performance boosts and find that across various patient stays, removing modalities leads to divergent behavior that our contrastive pre-training mitigates. By abstracting clinical diagnosis as sequence modeling and interpreting patient stay trajectories, we develop a framework to profile and handle missing modalities while addressing the canonical desideratum of safe, transparent clinical AI.

摘要:在為醫療保健開發多模態機器學習(ML)模型的過程中,一個活躍的挑戰是處理訓練和部署期間缺失的模態。由於臨床數據集本質上是時間性的,並且在模態存在方面稀疏,因此通過診斷多模態 ML 模型捕捉潛在的預測信號,同時保持模型的可解釋性,仍然是一個持續的挑戰。在這項工作中,我們通過將臨床診斷重新框架為自回歸序列建模任務來解決這個問題,利用大型語言模型(LLMs)中的因果解碼器來建模患者的多模態軌跡。我們首先介紹了一種考慮缺失性的對比預訓練目標,該目標在具有缺失性的數據集中將多個模態整合到共享潛在空間中。然後,我們展示了基於Transformer架構的自回歸序列建模在 MIMIC-IV 和 eICU 微調基準測試中超越了基準。我們最後使用可解釋性技術超越性能提升,發現隨著各種患者住院的進展,去除模態會導致不同的行為,而我們的對比預訓練可以減輕這種情況。通過將臨床診斷抽象為序列建模並解釋患者住院軌跡,我們開發了一個框架來描述和處理缺失模態,同時解決安全、透明的臨床 AI 的基本願望。

On the Importance and Evaluation of Narrativity in Natural Language AI Explanations

2604.18311v1 by Mateusz Cedro, David Martens

Explainable AI (XAI) aims to make the behaviour of machine learning models interpretable, yet many explanation methods remain difficult to understand. The integration of Natural Language Generation into XAI aims to deliver explanations in textual form, making them more accessible to practitioners. Current approaches, however, largely yield static lists of feature importances. Although such explanations indicate what influences the prediction, they do not explain why the prediction occurs. In this study, we draw on insights from social sciences and linguistics, and argue that XAI explanations should be presented in the form of narratives. Narrative explanations support human understanding through four defining properties: continuous structure, cause-effect mechanisms, linguistic fluency, and lexical diversity. We show that standard Natural Language Processing (NLP) metrics based solely on token probability or word frequency fail to capture these properties and can be matched or exceeded by tautological text that conveys no explanatory content. To address this issue, we propose seven automatic metrics that quantify the narrative quality of explanations along the four identified dimensions. We benchmark current state-of-the-art explanation generation methods on six datasets and show that the proposed metrics separate descriptive from narrative explanations more reliably than standard NLP metrics. Finally, to further advance the field, we propose a set of problem-agnostic XAI Narrative generation rules for producing natural language XAI explanations, so that the resulting XAI Narratives exhibit stronger narrative properties and align with the findings from the linguistic and social science literature.

摘要:可解釋的人工智慧(XAI)旨在使機器學習模型的行為可解釋,但許多解釋方法仍然難以理解。將自然語言生成整合到XAI中,旨在以文本形式提供解釋,使其對從業者更具可及性。然而,目前的方法主要產生靜態的特徵重要性列表。雖然這些解釋表明了什麼影響了預測,但並未解釋預測為何會發生。在本研究中,我們借鑒社會科學和語言學的見解,並主張XAI解釋應以敘事的形式呈現。敘事解釋通過四個定義特徵支持人類理解:連續結構、因果機制、語言流暢性和詞彙多樣性。我們顯示,僅基於標記概率或詞頻的標準自然語言處理(NLP)指標無法捕捉這些特性,並且可以被傳達無解釋內容的同義文本匹配或超越。為了解決這個問題,我們提出七個自動指標,以量化解釋在四個識別維度上的敘事質量。我們在六個數據集上基準測試當前最先進的解釋生成方法,並顯示所提出的指標比標準NLP指標更可靠地區分描述性解釋和敘事解釋。最後,為了進一步推進該領域,我們提出一套與問題無關的XAI敘事生成規則,以產生自然語言的XAI解釋,使得所產生的XAI敘事展現更強的敘事特性,並與語言學和社會科學文獻的研究結果相一致。

Toward Zero-Egress Psychiatric AI: On-Device LLM Deployment for Privacy-Preserving Mental Health Decision Support

2604.18302v1 by Eranga Bandara, Asanga Gunaratna, Ross Gore, Anita H. Clayton, Christopher K. Rhea, Sachini Rajapakse, Isurunima Kularathna, Sachin Shetty, Ravi Mukkamala, Xueping Liang, Preston Samuel, Atmaram Yarlagadda

Privacy represents one of the most critical yet underaddressed barriers to AI adoption in mental healthcare -- particularly in high-sensitivity operational environments such as military, correctional, and remote healthcare settings, where the risk of patient data exposure can deter help-seeking behavior entirely. Existing AI-enabled psychiatric decision support systems predominantly rely on cloud-based inference pipelines, requiring sensitive patient data to leave the device and traverse external servers, creating unacceptable privacy and security risks in these contexts. In this paper, we propose a zero-egress, on-device AI platform for privacy-preserving psychiatric decision support, deployed as a cross-platform mobile application. The proposed system extends our prior work on fine-tuned LLM consortiums for psychiatric diagnosis standardization by fundamentally re-architecting the inference pipeline for fully local execution -- ensuring that no patient data is transmitted to, processed by, or stored on any external server at any stage. The platform integrates a consortium of three lightweight, fine-tuned, and quantized open-source LLMs -- Gemma, Phi-3.5-mini, and Qwen2 -- selected for their compact architectures and proven efficiency on resource-constrained mobile hardware. An on-device orchestration layer coordinates ensemble inference and consensus-based diagnostic reasoning, producing DSM-5-aligned assessments for conditions. The platform is designed to assist clinicians with differential diagnosis and evidence-linked symptom mapping, as well as to support patient-facing self-screening with appropriate clinical safeguards. Initial evaluation demonstrates that the proposed zero-egress deployment achieves diagnostic accuracy comparable to its server-side predecessor while sustaining real-time inference latency on commodity mobile hardware.

摘要:隱私代表了在心理健康護理中人工智慧採用的最關鍵但卻未被充分解決的障礙之一——特別是在軍事、矯正和遠程醫療等高敏感度操作環境中,患者數據暴露的風險可能完全阻礙尋求幫助的行為。現有的人工智慧輔助精神病決策支持系統主要依賴雲端推理管道,這要求敏感的患者數據離開設備並經過外部伺服器,從而在這些環境中造成不可接受的隱私和安全風險。在本文中,我們提出了一個零外洩的、基於設備的人工智慧平台,用於隱私保護的精神病決策支持,作為跨平台的移動應用程序部署。所提出的系統擴展了我們之前在精神病診斷標準化方面的精調大型語言模型聯盟的工作,通過根本性地重新架構推理管道以實現完全本地執行——確保在任何階段都不會將患者數據傳輸到、處理或存儲在任何外部伺服器上。該平台整合了三個輕量級、經過精調和量化的開源大型語言模型——Gemma、Phi-3.5-mini和Qwen2——這些模型因其緊湊的架構和在資源受限的移動硬體上的證明效率而被選中。一個基於設備的協調層協調集成推理和基於共識的診斷推理,生成與DSM-5對齊的條件評估。該平台旨在協助臨床醫生進行鑑別診斷和證據鏈接的症狀映射,並支持患者自我篩查,並提供適當的臨床保障。初步評估表明,所提出的零外洩部署在診斷準確性上與其伺服器端前身相當,同時在商用移動硬體上保持實時推理延遲。

Rabies diagnosis in low-data settings: A comparative study on the impact of data augmentation and transfer learning

2604.19823v1 by Khalil Akremi, Mariem Handous, Zied Bouslama, Farah Bassalah, Maryem Jebali, Mariem Hanachi, Ines Abdeljaoued-Tej

Rabies remains a major public health concern across many African and Asian countries, where accurate diagnosis is critical for effective epidemiological surveillance. The gold standard diagnostic methods rely heavily on fluorescence microscopy, necessitating skilled laboratory personnel for the accurate interpretation of results. Such expertise is often scarce, particularly in regions with low annual sample volumes. This paper presents an automated, AI-driven diagnostic system designed to address these challenges. We developed a robust pipeline utilizing fluorescent image analysis through transfer learning with four deep learning architectures: EfficientNetB0, EfficientNetB2, VGG16, and Vision Transformer (ViTB16). Three distinct data augmentation strategies were evaluated to enhance model generalization on a dataset of 155 microscopic images (123 positive and 32 negative). Our results demonstrate that TrivialAugmentWide was the most effective augmentation technique, as it preserved critical fluorescent patterns while improving model robustness. The EfficientNetB0 model, utilizing Geometric & Color augmentation and selected through stratified 3fold cross-validation, achieved optimal classification performance on cropped images. Despite constraints posed by class imbalance and a limited dataset size, this work confirms the viability of deep learning for automating rabies diagnosis. The proposed method enables fast and reliable detection with significant potential for further optimization. An online tool was deployed to facilitate practical access, establishing a framework for future medical imaging applications. This research underscores the potential of optimized deep learning models to transform rabies diagnostics and improve public health outcomes.

摘要:狂犬病在許多非洲和亞洲國家仍然是一個主要的公共健康問題,準確的診斷對於有效的流行病學監測至關重要。黃金標準的診斷方法在很大程度上依賴於螢光顯微鏡,這需要熟練的實驗室人員來準確解讀結果。這種專業知識往往稀缺,尤其是在年樣本量較低的地區。本文提出了一種自動化的、基於人工智慧的診斷系統,旨在解決這些挑戰。我們開發了一個穩健的流程,通過轉移學習利用四種深度學習架構進行螢光影像分析:EfficientNetB0、EfficientNetB2、VGG16 和 Vision Transformer (ViTB16)。我們評估了三種不同的數據增強策略,以提高模型在155張顯微鏡影像(123張陽性和32張陰性)數據集上的泛化能力。我們的結果顯示,TrivialAugmentWide 是最有效的增強技術,因為它在改善模型穩健性的同時保留了關鍵的螢光模式。使用幾何和顏色增強的 EfficientNetB0 模型,通過分層三折交叉驗證選擇,實現了在裁剪影像上的最佳分類性能。儘管受到類別不平衡和數據集大小限制的挑戰,這項工作證實了深度學習在自動化狂犬病診斷中的可行性。所提出的方法實現了快速且可靠的檢測,並具有進一步優化的重大潛力。還部署了一個在線工具以促進實際訪問,為未來的醫學影像應用建立了一個框架。本研究強調了優化的深度學習模型在轉變狂犬病診斷和改善公共健康結果方面的潛力。

ExAI5G: A Logic-Based Explainable AI Framework for Intrusion Detection in 5G Networks

2604.18052v1 by Saeid Sheikhi, Panos Kostakos, Lauri Loven

Intrusion detection systems (IDSs) for 5G networks must handle complex, high-volume traffic. Although opaque "black-box" models can achieve high accuracy, their lack of transparency hinders trust and effective operational response. We propose ExAI5G, a framework that prioritizes interpretability by integrating a Transformer-based deep learning IDS with logic-based explainable AI (XAI) techniques. The framework uses Integrated Gradients to attribute feature importance and extracts a surrogate decision tree to derive logical rules. We introduce a novel evaluation methodology for LLM-generated explanations, using a powerful evaluator LLM to assess actionability and measuring their semantic similarity and faithfulness. On a 5G IoT intrusion dataset, our system achieves 99.9\% accuracy and a 0.854 macro F1-score, demonstrating strong performance. More importantly, we extract 16 logical rules with 99.7\% fidelity, making the model's reasoning transparent. The evaluation demonstrates that modern LLMs can generate explanations that are both faithful and actionable, indicating that it is possible to build a trustworthy and effective IDS without compromising performance for the sake of marginal gains from an opaque model.

摘要:入侵檢測系統(IDS)針對 5G 網路必須處理複雜且高容量的流量。儘管不透明的「黑箱」模型可以達到高準確率,但其缺乏透明度妨礙了信任和有效的操作反應。我們提出 ExAI5G,一個優先考慮可解釋性的框架,通過將基於 Transformer 的深度學習 IDS 與基於邏輯的可解釋 AI(XAI)技術整合。該框架使用整合梯度來歸因特徵重要性,並提取替代決策樹以推導邏輯規則。我們引入了一種新的評估方法,用於 LLM 生成的解釋,使用強大的評估器 LLM 來評估可行性,並測量其語義相似性和忠實度。在 5G IoT 入侵數據集上,我們的系統達到了 99.9\% 的準確率和 0.854 的宏觀 F1 分數,顯示出強大的性能。更重要的是,我們提取了 16 條邏輯規則,具有 99.7\% 的忠實度,使模型的推理過程變得透明。評估顯示,現代 LLM 能夠生成既忠實又可行的解釋,表明可以在不妥協性能的情況下,建立一個值得信賴且有效的 IDS,而不必為了從不透明模型中獲得微小的收益而妥協。

First, Do No Harm (With LLMs): Mitigating Racial Bias via Agentic Workflows

2604.18038v1 by Sihao Xing, Zaur Gouliev

Large language models (LLMs) are increasingly used in clinical settings, raising concerns about racial bias in both generated medical text and clinical reasoning. Existing studies have identified bias in medical LLMs, but many focus on single models and give less attention to mitigation. This study uses the EU AI Act as a governance lens to evaluate five widely used LLMs across two tasks, namely synthetic patient-case generation and differential diagnosis ranking. Using race-stratified epidemiological distributions in the United States and expert differential diagnosis lists as benchmarks, we apply structured prompt templates and a two-part evaluation design to examine implicit and explicit racial bias. All models deviated from observed racial distributions in the synthetic case generation task, with GPT-4.1 showing the smallest overall deviation. In the differential diagnosis task, DeepSeek V3 produced the strongest overall results across the reported metrics. When embedded in an agentic workflow, DeepSeek V3 showed an improvement of 0.0348 in mean p-value, 0.1166 in median p-value, and 0.0949 in mean difference relative to the standalone model, although improvement was not uniform across every metric. These findings support multi-metric bias evaluation for AI systems used in medical settings and suggest that retrieval-based agentic workflows may reduce some forms of explicit bias in benchmarked diagnostic tasks. Detailed prompt templates, experimental datasets, and code pipelines are available on our GitHub.

摘要:大型語言模型(LLMs)在臨床環境中的使用日益增加,這引發了對生成的醫療文本和臨床推理中的種族偏見的擔憂。現有研究已經識別出醫療LLMs中的偏見,但許多研究專注於單一模型,對於減輕偏見的關注較少。本研究使用歐盟人工智慧法案作為治理視角,評估五個廣泛使用的LLMs在兩個任務中的表現,即合成病人案例生成和鑑別診斷排名。利用美國的種族分層流行病學分佈和專家鑑別診斷清單作為基準,我們應用結構化提示模板和雙部分評估設計來檢查隱性和顯性種族偏見。在合成案例生成任務中,所有模型均偏離了觀察到的種族分佈,其中GPT-4.1的整體偏差最小。在鑑別診斷任務中,DeepSeek V3在報告的指標中產生了最強的整體結果。當嵌入到一個自主工作流程中時,DeepSeek V3在平均p值上改善了0.0348,在中位數p值上改善了0.1166,在平均差異上改善了0.0949,相對於獨立模型,儘管在每個指標上的改善並不均勻。這些發現支持對醫療環境中使用的AI系統進行多指標偏見評估,並表明基於檢索的自主工作流程可能減少基準診斷任務中的某些顯性偏見。詳細的提示模板、實驗數據集和代碼管道可在我們的GitHub上獲得。

How Much Cache Does Reasoning Need? Depth-Cache Tradeoffs in KV-Compressed Transformers

2604.17935v1 by Xiao Wang

The key-value (KV) cache is the dominant memory bottleneck during Transformer inference, yet little is known theoretically about how aggressively it can be compressed before multi-step reasoning degrades. We study this through $k$-hop pointer chasing on $n$ tokens under a shared KV cache of size $s$, attention dimension $m$, $H$ heads, $p$-bit precision, and a locality-respecting cache controller (satisfied by all standard KV-compression methods). We give three results. (1) Product depth lower bound (conjectured). We conjecture that any such Transformer ($n \geq 4k$, $s \leq \sqrt{n}/4$) requires depth $L = Ω(\lceil k/s \rceil \cdot \lceil \log_2 n/(Hmp) \rceil)$, and isolate the sole remaining gap as a probabilistic step on the joint distribution of cache trace and pointer chain. Unconditionally, we prove a matching upper bound $L = O(\min(k, \lceil k/s \rceil \log s) \cdot \log n/(mp))$ via windowed pointer doubling, and a max-bound $L = Ω(\max(\lceil k/s \rceil, \log n/(Hmp)))$. Closing the conjecture amounts to upgrading max to product. (2) Bandwidth barrier. The product bound binds only when $Hmp \lesssim \log n$. Any lower bound provable via per-window distinguishability counting -- including reachability, bandwidth, and combinations -- cannot exceed $\lceil k/s \rceil$ once $Hmp \geq \log_2 n$. Breaking this requires lifting unconditional communication-complexity bounds for pointer chasing to Cache-Transformer depth. (3) Adaptive vs oblivious error scaling. Under random cache over $T = \lceil \log_2 k \rceil$ doubling stages, oblivious caches give $\Pr[\mathcal{E}] \leq (s/(n-T))^T + 2T^3/n$ (exponential in $T$), while adaptive locality-respecting caches achieve $\Pr[\mathcal{E}] = s/n$ exactly, independent of $T$. The $Ω((n/s)^{T-1})$ separation explains why heavy-hitter eviction empirically dominates random eviction for multi-hop reasoning.

摘要:關鍵值(KV)快取在Transformer推理過程中是主要的記憶體瓶頸,但對於在多步推理降級之前,它可以被壓縮到什麼程度,理論上知之甚少。我們通過在共享大小為 $s$ 的 KV 快取下,對 $n$ 個標記進行 $k$-跳指標追逐來研究這一點,注意力維度為 $m$,$H$ 個頭,$p$ 位精度,以及一個尊重區域性的快取控制器(所有標準的 KV 壓縮方法都能滿足此條件)。我們給出三個結果。 (1) 產品深度下界(猜想)。我們猜想任何此類Transformer($n \geq 4k$, $s \leq \sqrt{n}/4$)需要深度 $L = Ω(\lceil k/s \rceil \cdot \lceil \log_2 n/(Hmp) \rceil)$,並將唯一剩餘的差距隔離為快取痕跡和指標鏈的聯合分佈上的一個概率步驟。在無條件的情況下,我們通過窗口式指標加倍證明了一個匹配的上界 $L = O(\min(k, \lceil k/s \rceil \log s) \cdot \log n/(mp))$,以及一個最大界 $L = Ω(\max(\lceil k/s \rceil, \log n/(Hmp)))$。關閉這個猜想相當於將最大值升級為產品。 (2) 帶寬障礙。產品界僅在 $Hmp \lesssim \log n$ 時約束。任何通過每窗口可區分性計數可證明的下界——包括可達性、帶寬和組合——一旦 $Hmp \geq \log_2 n$,都不能超過 $\lceil k/s \rceil$。打破這一點需要將無條件的通信複雜度界限提升到快取Transformer深度。 (3) 自適應 vs 無知誤差擴展。在 $T = \lceil \log_2 k \rceil$ 次加倍階段下的隨機快取中,無知快取給出 $\Pr[\mathcal{E}] \leq (s/(n-T))^T + 2T^3/n$(對 $T$ 指數),而自適應尊重區域性的快取則精確地達到 $\Pr[\mathcal{E}] = s/n$,與 $T$ 無關。$Ω((n/s)^{T-1})$ 的分離解釋了為什麼重擊者驅逐在經驗上主導於隨機驅逐,特別是在多跳推理中。

AI Approach for MRI-only Full-Spine Vertebral Segmentation and 3D Reconstruction in Paediatric Scoliosis

2604.17846v1 by Nathasha Naranpanawa, Maree T. Izatt, Robert D. Labrom, Geoffrey N. Askin, J. Paige Little

MRI is preferred over CT in paediatric imaging because it avoids ionising radiation, but its use in spine deformity assessment is largely limited by the lack of automated, high-resolution 3D bony reconstruction, which continues to rely on CT. MRI-based 3D reconstruction remains impractical due to manual workflows and the scarcity of labelled full-spine datasets. This study introduces an AI framework that enables fully automated thoracolumbar spine (T1-L5) segmentation and 3D reconstruction from MRI alone. Historical low-dose CT scans from adolescent idiopathic scoliosis (AIS) patients were converted into MRI-like images using a GAN and combined with existing labelled thoracic MRI data to train a U-Net-based model. The resulting algorithm accurately generated continuous thoracolumbar 3D reconstructions, improved segmentation accuracy (88% Dice score), and reduced processing time from approximately 1 hour to under one minute, while preserving AIS-specific deformity features. This approach enables radiation-free 3D deformity assessment from MRI, supporting clinical evaluation, surgical planning, and navigation in paediatric spine care.

摘要:MRI 在兒童影像學中較 CT 更受青睞,因為它避免了電離輻射,但在脊柱畸形評估中的應用主要受到缺乏自動化、高解析度 3D 骨重建的限制,這仍然依賴於 CT。基於 MRI 的 3D 重建因手動工作流程和標註完整脊柱數據集的稀缺而仍然不切實際。本研究介紹了一個 AI 框架,能夠從 MRI 單獨實現完全自動化的胸腰脊柱 (T1-L5) 分割和 3D 重建。來自青少年特發性脊柱側彎 (AIS) 患者的歷史低劑量 CT 掃描被轉換為類似 MRI 的影像,並與現有的標註胸部 MRI 數據結合,以訓練基於 U-Net 的模型。所生成的算法準確地生成了連續的胸腰 3D 重建,提高了分割準確性 (88% Dice 分數),並將處理時間從約 1 小時縮短至不到 1 分鐘,同時保留了特定於 AIS 的畸形特徵。這種方法使得從 MRI 進行無輻射的 3D 畸形評估成為可能,支持臨床評估、手術規劃和兒童脊柱護理中的導航。

Community-Led AI Integration for Wildfire Risk Assessment: A Participatory AI Literacy and Explainability Integration (PALEI) Framework in Los Angeles, CA

2604.17755v1 by Sanaz Sadat Hosseini, Mona Azarbayjani, Mohammad Pourhomayoun, Hamed Tabkhi

Climate-driven wildfires are intensifying, particularly in urban regions such as Southern California. Yet, traditional fire risk communication tools often fail to gain public trust due to inaccessible design, non-transparent outputs, and limited contextual relevance. These challenges are especially critical in high-risk communities, where trust depends on how clearly and locally information is presented. Neighborhoods such as Pacific Palisades, Pasadena, and Altadena in Los Angeles exemplify these conditions. This study introduces a community-led approach for integrating AI into wildfire risk assessment using the Participatory AI Literacy and Explainability Integration (PALEI) framework. PALEI emphasizes early literacy building, value alignment, and participatory evaluation before deploying predictive models, prioritizing clarity, accessibility, and mutual learning between developers and residents. Early engagement findings show strong acceptance of visual, context-specific risk communication, positive fairness perceptions, and clear adoption interest, alongside privacy and data security concerns that influence trust. Participants emphasized localized imagery, accessible explanations, neighborhood-specific mitigation guidance, and transparent communication of uncertainty. The outcome is a mobile application co-designed with users and stakeholders, enabling residents to scan visible property features and receive interpretable fire risk scores with tailored recommendations. By embedding local context into design, the tool becomes an everyday resource for risk awareness and preparedness. This study argues that user experience is central to ethical and effective AI deployment and provides a replicable, literacy-first pathway for applying the PALEI framework to climate-related hazards.

摘要:氣候驅動的野火正在加劇,特別是在南加州等城市地區。然而,傳統的火災風險傳達工具往往因設計不易接觸、輸出不透明和上下文相關性有限而未能獲得公眾信任。這些挑戰在高風險社區中特別重要,因為信任取決於信息呈現的清晰度和地方性。洛杉磯的太平洋帕利塞德斯、帕薩迪納和阿爾塔迪納等社區就是這些情況的典範。本研究提出了一種社區主導的方法,利用參與式人工智慧素養與解釋整合(PALEI)框架將人工智慧整合到野火風險評估中。PALEI 強調在部署預測模型之前建立早期素養、價值對齊和參與性評估,優先考慮清晰性、可接觸性和開發者與居民之間的相互學習。早期參與的研究結果顯示,對視覺化、具上下文特定的風險傳達有強烈的接受度,對公平性的正面感知,以及明確的採用興趣,同時也存在影響信任的隱私和數據安全問題。參與者強調了本地化的圖像、可接觸的解釋、社區特定的減緩指導以及不確定性的透明傳達。最終的結果是一個與用戶和利益相關者共同設計的移動應用程序,使居民能夠掃描可見的財產特徵並獲得可解釋的火災風險評分和量身定制的建議。通過將地方上下文嵌入設計中,該工具成為風險意識和準備的日常資源。本研究認為,用戶體驗是道德和有效的人工智慧部署的核心,並提供了一條可複製的、以素養為首的途徑,以將 PALEI 框架應用於氣候相關的危害。

MHSafeEval: Role-Aware Interaction-Level Evaluation of Mental Health Safety in Large Language Models

2604.17730v1 by Suhyun Lee, Palakorn Achananuparp, Neemesh Yadav, Ee-Peng Lim, Yang Deng

Large language models (LLMs) are increasingly explored as scalable tools for mental health counseling, yet evaluating their safety remains challenging due to the interactional and context-dependent nature of clinical harm. Existing evaluation frameworks predominantly assess isolated responses using coarse-grained taxonomies or static datasets, limiting their ability to diagnose how harms emerge and accumulate over multi-turn counseling interactions. In this work, we introduce R-MHSafe, a role-aware mental health safety taxonomy that characterizes clinically significant harm in terms of the interactional roles an AI counselor adopts, including perpetrator, instigator, facilitator, or enabler, combined with clinically grounded harm categories. Then, we propose MHSafeEval, a closed-loop, agent-based evaluation framework that formulates safety assessment as trajectory-level discovery of harm through adversarial multi-turn interactions, guided by role-aware modeling. Using R-MHSafe and MHSafeEval, we conduct a large-scale evaluation across state-of-the-art LLMs. Our results reveal substantial role-dependent and cumulative safety failures that are systematically missed by existing static benchmarks, and show that our framework significantly improves failure-mode coverage and diagnostic granularity.

摘要:大型語言模型(LLMs)越來越多地被探索作為可擴展的心理健康諮詢工具,然而,由於臨床傷害的互動性和情境依賴性,評估它們的安全性仍然具有挑戰性。現有的評估框架主要使用粗糙的分類法或靜態數據集來評估孤立的反應,這限制了它們診斷傷害如何在多輪諮詢互動中出現和累積的能力。在這項工作中,我們介紹了 R-MHSafe,一種角色感知的心理健康安全分類法,根據 AI 諮詢師所採取的互動角色(包括施害者、煽動者、促進者或使能者)來描述臨床上重要的傷害,並結合臨床基礎的傷害類別。然後,我們提出了 MHSafeEval,一個閉環的基於代理的評估框架,將安全評估公式化為通過對抗性多輪互動的傷害軌跡級別發現,並以角色感知建模為指導。使用 R-MHSafe 和 MHSafeEval,我們對最先進的 LLMs 進行了大規模評估。我們的結果揭示了顯著的角色依賴性和累積性安全失敗,這些失敗在現有的靜態基準中被系統性地忽略,並顯示我們的框架顯著提高了失敗模式的覆蓋率和診斷的細緻度。

Semantic Entanglement in Vector-Based Retrieval: A Formal Framework and Context-Conditioned Disentanglement Pipeline for Agentic RAG Systems

2604.17677v1 by Nick Loghmani

Retrieval-Augmented Generation (RAG) systems depend on the geometric properties of vector representations to retrieve contextually appropriate evidence. When source documents interleave multiple topics within contiguous text, standard vectorization produces embedding spaces in which semantically distinct content occupies overlapping neighborhoods. We term this condition semantic entanglement. We formalize entanglement as a model-relative measure of cross-topic overlap in embedding space and define an Entanglement Index (EI) as a quantitative proxy. We argue that higher EI constrains attainable Top-K retrieval precision under cosine similarity retrieval. To address this, we introduce the Semantic Disentanglement Pipeline (SDP), a four-stage preprocessing framework that restructures documents prior to embedding. We further propose context-conditioned preprocessing, in which document structure is shaped by patterns of operational use, and a continuous feedback mechanism that adapts document structure based on agent performance. We evaluate SDP on a real-world enterprise healthcare knowledge base comprising over 2,000 documents across approximately 25 sub-domains. Top-K retrieval precision improves from approximately 32% under fixed-token chunking to approximately 82% under SDP, while mean EI decreases from 0.71 to 0.14. We do not claim that entanglement fully explains RAG failure, but that it captures a distinct preprocessing failure mode that downstream optimization cannot reliably correct once encoded into the vector space.

摘要:檢索增強生成(RAG)系統依賴於向量表示的幾何特性來檢索上下文適當的證據。當來源文檔在連續文本中交織多個主題時,標準向量化會產生嵌入空間,其中語義上不同的內容佔據重疊的鄰域。我們將這種情況稱為語義纏結。我們將纏結形式化為嵌入空間中跨主題重疊的模型相對度量,並定義一個纏結指數(EI)作為定量代理。我們認為較高的EI限制了在餘弦相似性檢索下可達的Top-K檢索精度。為了解決這個問題,我們引入了語義解纏管道(SDP),這是一個四階段的預處理框架,在嵌入之前重組文檔。我們進一步提出了基於上下文的預處理,其中文檔結構由操作使用模式塑造,並且有一個連續反饋機制,根據代理性能調整文檔結構。我們在一個包含超過2000份文檔的現實世界企業醫療知識庫上評估SDP,該知識庫涵蓋約25個子領域。Top-K檢索精度從固定標記分塊下的約32%提高到SDP下的約82%,而平均EI從0.71降低到0.14。我們並不聲稱纏結完全解釋了RAG的失敗,但它捕捉了一種明確的預處理失敗模式,而下游優化在編碼進入向量空間後無法可靠地修正。

On The Mathematics of the Natural Physics of Optimization

2604.17645v1 by I. M. Ross

A number of optimization algorithms have been inspired by the physics of Newtonian motion. Here, we ask the question: do algorithms themselves obey some natural laws of motion,'' and can they be derived by an application of these laws? We explore this question by positing the theory that optimization algorithms may be considered as some manifestation of hidden algorithm primitives that obey certain universal non-Newtonian dynamics. This natural physics of optimization is developed by equating the terminal transversality conditions of an optimal control problem to the generalized Karush/John-Kuhn-Tucker conditions of an optimization problem. Through this equivalence formulation, the data functions of a given constrained optimization problem generate a natural vector field that permeates an entire hidden space with information on the optimality conditions. Anaction-at-a-distance'' operation via a Pontryagin-type minimum principle produces a local action to deliver a globalized result by way of a Hamilton-Jacobi inequality. An inverse-optimal algorithm is generated by performing control jumps that dissipate quantized ``energy'' defined by a search Lyapunov function. Illustrative applications of the proposed theory show that a large number of algorithms can be generated and explained in terms of the new mathematical physics of optimization.

摘要:許多優化演算法受到牛頓運動物理學的啟發。在這裡,我們提出一個問題:演算法本身是否遵循某些「自然運動法則」,並且這些法則是否可以用來推導演算法?我們通過假設優化演算法可以被視為遵循某些普遍非牛頓動力學的隱藏演算法原始元素的某種表現來探討這個問題。這種優化的自然物理學是通過將最佳控制問題的終端橫斷條件等同於優化問題的廣義Karush/John-Kuhn-Tucker條件來發展的。通過這種等價公式,給定約束優化問題的數據函數生成了一個自然向量場,該向量場充滿了有關最佳條件的信息,滲透整個隱藏空間。通過Pontryagin型最小原則的「遠程作用」操作產生了一個局部行動,通過哈密頓-雅可比不等式提供全球化結果。通過執行控制跳躍來生成一個逆最優演算法,這些跳躍耗散由搜索Lyapunov函數定義的量化「能量」。所提出理論的示範應用顯示,許多演算法可以根據新的優化數學物理學來生成和解釋。

STEP-PD: Stage-Aware and Explainable Parkinson's Disease Severity Classification Using Multimodal Clinical Assessments

2604.17611v1 by Md Mezbahul Islam, John Michael Templeton, Christian Poellabauer, Ananda Mohan Mondal

Parkinson's disease (PD) is a progressive disorder in which symptom burden and functional impairment evolve over time, making severity staging essential for clinical monitoring and treatment planning. However, many computational studies emphasize binary PD detection and do not fully use repeated follow-up clinical assessments for stage-aware prediction. This study proposes STEP-PD, a severity-aware machine learning framework to classify PD severity using clinically interpretable boundaries. It leverages all available visits from the Parkinson's Progression Markers Initiative (PPMI) and integrates routinely collected subjective questionnaires and objective clinician-assessed measures. Disease severity is defined using Hoehn and Yahr staging and grouped into three clinically meaningful categories: Healthy, Mild PD (stages 1-2), and Moderate-to-Severe PD (stages 3-5). Three binary classification problems and a three-class severity task were evaluated using stratified cross-validation with imbalance-aware training. To enhance interpretability, SHAP was used to provide global explanations and local patient-level waterfall explanations. Across all tasks, XGBoost achieved the strongest and most stable performance, with accuracies of 95.48% (Healthy vs. Mild), 99.44% (Healthy vs. Moderate-to-Severe), and 96.78% (Mild vs. Moderate-to-Severe), and 94.14% accuracy with 0.8775 Macro-F1 for three-class severity classification. Explainability results highlight a shift from early motor features to progression-related axial and balance impairments. These findings show that multimodal clinical assessments within the PPMI cohort can support accurate and interpretable visit-level PD severity stratification.

摘要:帕金森病(PD)是一種漸進性疾病,其症狀負擔和功能障礙隨時間演變,因此對於臨床監測和治療計劃來說,嚴重程度分級是必不可少的。然而,許多計算研究強調二元的PD檢測,並未充分利用重複的隨訪臨床評估來進行階段感知的預測。本研究提出了STEP-PD,一個重視嚴重程度的機器學習框架,用於使用臨床可解釋的邊界來分類PD的嚴重程度。它利用來自帕金森病進展標記計劃(PPMI)的所有可用訪問,並整合常規收集的主觀問卷和客觀臨床評估指標。疾病的嚴重程度是使用Hoehn和Yahr分級來定義的,並分為三個臨床意義明確的類別:健康、輕度PD(1-2期)和中度至重度PD(3-5期)。通過分層交叉驗證和考慮不平衡的訓練,評估了三個二元分類問題和一個三類嚴重程度任務。為了增強可解釋性,使用SHAP提供全局解釋和局部患者級別的瀑布解釋。在所有任務中,XGBoost實現了最強且最穩定的性能,健康與輕度的準確率為95.48%、健康與中度至重度的準確率為99.44%、輕度與中度至重度的準確率為96.78%,以及三類嚴重程度分類的準確率為94.14%,Macro-F1為0.8775。可解釋性結果突顯了從早期運動特徵到與進展相關的軸向和平衡障礙的轉變。這些發現表明,PPMI隊列中的多模態臨床評估可以支持準確且可解釋的訪問級PD嚴重程度分層。

CDSA-Net:Collaborative Decoupling of Vascular Structure and Background for High-Fidelity Coronary Digital Subtraction Angiography

2604.17208v1 by Si Li, Chen-Kai Hu, Zhenhuan Lyu, Yuanqing He

Digital subtraction angiography (DSA) in coronary imaging is fundamentally challenged by physiological motion, forcing reliance on raw angiograms cluttered with anatomical noise. Existing deep learning methods often produced images with two critical clinically unacceptable flaws: persistent boundary artifacts and a loss of native tissue grayscale fidelity that undermined diagnostic confidence. We propose a novel framework termed as CDSA-Net that for the first time explicitly decouples and jointly optimizes vascular structure preservation and realistic background restoration. CDSA-Net introduces two core innovations: (i) A hierarchical geometric prior guidance (HGPG) mechanism, embedded in our coronary structure extraction network (CSENet). It synergistically combines integrated geometric prior (IGP) with gated spatial modulation (GSM) and centerline-aware topology (CAT) loss supervision, ensuring structural continuity. (ii) An adaptive noise module (ANM) within our coronary background restoration network (CBResNet). Unlike standard restoration, ANM uniquely models the stochastic nature of clinical X-ray noise, bridging the domain gap to enable seamless background intensity estimation and the complete elimination of boundary artifacts. The final subtraction is obtained by removing the restored background from the raw angiogram. Quantitatively, it significantly outperformed state-of-the-art methods in vascular intensity correlation and perceptual quality. A 25.6% improvement in morphology assessment efficiency and a 42.9% gain in hemodynamic evaluation speed set a new benchmark for utility in interventional cardiology, while maintaining diagnostic results consistent with raw angiograms. The project code is available at https://github.com/DrThink-ai/CDSA-Net.

摘要:數位減影血管造影(DSA)在冠狀動脈影像中受到生理運動的根本挑戰,迫使人們依賴充滿解剖噪音的原始血管造影圖像。現有的深度學習方法通常產生兩個關鍵的臨床不可接受的缺陷:持續的邊界伪影和原生組織灰階保真度的喪失,這削弱了診斷信心。我們提出了一個名為 CDSA-Net 的新框架,首次明確地解耦並聯合優化血管結構保護和現實背景恢復。CDSA-Net 引入了兩個核心創新:(i)一種分層幾何先驗引導(HGPG)機制,嵌入我們的冠狀結構提取網絡(CSENet)。它協同結合了集成幾何先驗(IGP)、門控空間調制(GSM)和中心線感知拓撲(CAT)損失監督,確保結構連續性。(ii)我們的冠狀背景恢復網絡(CBResNet)內的一個自適應噪聲模塊(ANM)。與標準恢復不同,ANM 獨特地建模臨床 X 射線噪聲的隨機性質,彌合領域差距以實現無縫的背景強度估計和完全消除邊界伪影。最終的減法是通過從原始血管造影中去除恢復的背景來獲得的。在定量上,它在血管強度相關性和感知質量方面顯著超越了最先進的方法。在形態評估效率上提高了 25.6%,在血流動力學評估速度上提高了 42.9%,為介入心臟病學的實用性設立了新的基準,同時保持診斷結果與原始血管造影一致。項目代碼可在 https://github.com/DrThink-ai/CDSA-Net 獲得。

Persona-Based Requirements Engineering for Explainable Multi-Agent Educational Systems: A Scenario Simulator for Clinical Reasoning Training

2604.17186v1 by Weibing Zheng, Laurah Turner, Jess Kropczynski, Matthew Kelleher, Murat Ozer, Shane Halse

As Artificial Intelligence (AI) and Agentic AI become increasingly integrated across sectors such as education and healthcare, it is critical to ensure that Multi-Agent Education System (MAES) is explainable from the early stages of requirements engineering (RE) within the AI software development lifecycle. Explainability is essential to build trust, promote transparency, and enable effective human-AI collaboration. Although personas are well-established in human-computer interaction to represent users and capture their needs and behaviors, their role in RE for explainable MAES remains underexplored. This paper proposes a human-first, persona-driven, explainable MAES RE framework and demonstrates the framework through a MAES for clinical reasoning training. The framework integrates personas and user stories throughout the RE process to capture the needs, goals, and interactions of various stakeholders, including medical educators, medical students, AI patient agent, and clinical agents (physical exam agent, diagnostic agent, clinical intervention agent, supervisor agent, evaluation agent). The goals, underlying models, and knowledge base shape agent interactions and inform explainability requirements that guided the clinical reasoning training of medical students. A post-usage survey found that more than 78\% of medical students reported that MAES improved their clinical reasoning skills. These findings demonstrate that RE based on persona effectively connects technical requirements with non-technical medical students from a human-centered approach, ensuring that explainable MAES are trustworthy, interpretable, and aligned with authentic clinical scenarios from the early stages of the AI system engineering. The partial MAES for the clinical scenario simulator is~\href{https://github.com/2sigmaEdTech/MAS/}{open sourced here}.

摘要:隨著人工智慧(AI)和代理型AI在教育和醫療等各個領域的日益整合,確保多代理教育系統(MAES)在AI軟體開發生命週期的需求工程(RE)早期階段是可解釋的,至關重要。可解釋性對於建立信任、促進透明度以及實現有效的人機協作至關重要。儘管角色在人機互動中被廣泛應用以代表用戶並捕捉他們的需求和行為,但在可解釋的MAES的需求工程中的角色仍然未被充分探索。本文提出了一個以人為本、以角色驅動的可解釋MAES需求工程框架,並通過一個用於臨床推理訓練的MAES來演示該框架。該框架在整個需求工程過程中整合了角色和用戶故事,以捕捉各種利益相關者的需求、目標和互動,包括醫學教育者、醫學學生、AI病人代理和臨床代理(身體檢查代理、診斷代理、臨床干預代理、監督代理、評估代理)。目標、基本模型和知識基礎塑造了代理的互動,並告知了指導醫學學生臨床推理訓練的可解釋性需求。使用後調查發現,超過78\%的醫學學生報告說MAES提高了他們的臨床推理技能。這些發現表明,基於角色的需求工程有效地將技術需求與非技術醫學學生聯繫起來,採用以人為中心的方法,確保可解釋的MAES是可信的、可解釋的,並與AI系統工程早期階段的真實臨床情境相一致。針對臨床情境模擬器的部分MAES已在~\href{https://github.com/2sigmaEdTech/MAS/}{這裡開源}。

Abstain-R1: Calibrated Abstention and Post-Refusal Clarification via Verifiable RL

2604.17073v1 by Skylar Zhai, Jingcheng Liang, Dongyeop Kang

Reinforcement fine-tuning improves the reasoning ability of large language models, but it can also encourage them to answer unanswerable queries by guessing or hallucinating missing information. Existing abstention methods either train models to produce generic refusals or encourage follow-up clarifications without verifying whether those clarifications identify the key missing information. We study queries that are clear in meaning but cannot be reliably resolved from the given information, and argue that a reliable model should not only abstain, but also explain what is missing. We propose a clarification-aware RLVR reward that, while rewarding correct answers on answerable queries, jointly optimizes explicit abstention and semantically aligned post-refusal clarification on unanswerable queries. Using this reward, we train Abstain-R1, a 3B model that improves abstention and clarification on unanswerable queries while preserving strong performance on answerable ones. Experiments on Abstain-Test, Abstain-QA, and SelfAware show that Abstain-R1 substantially improves over its base model and achieves unanswerable-query behavior competitive with larger systems including DeepSeek-R1, suggesting that calibrated abstention and clarification can be learned through verifiable rewards rather than emerging from scale alone.

摘要:強化微調提升了大型語言模型的推理能力,但它也可能促使模型通過猜測或幻覺缺失的信息來回答無法回答的問題。現有的放棄方法要麼訓練模型產生通用的拒絕,要麼鼓勵後續澄清,而不驗證這些澄清是否能識別關鍵的缺失信息。我們研究那些意義明確但無法從給定信息中可靠解決的查詢,並主張一個可靠的模型不僅應該放棄,還應該解釋缺失的內容。我們提出了一種關注澄清的RLVR獎勵,該獎勵在對可回答的查詢給予正確答案的同時,聯合優化明確的放棄和語義對齊的後拒絕澄清,針對無法回答的查詢。利用這一獎勵,我們訓練了Abstain-R1,一個3B模型,該模型在無法回答的查詢上改善了放棄和澄清,同時在可回答的查詢上保持強大的表現。在Abstain-Test、Abstain-QA和SelfAware上的實驗顯示,Abstain-R1在其基礎模型上有了顯著的改進,並在無法回答的查詢行為上達到了與包括DeepSeek-R1在內的更大系統的競爭水平,這表明經過驗證的獎勵可以學習到經過校準的放棄和澄清,而不僅僅是通過規模自然而然地出現。

Hybrid Quantum Neural Networks for Enhanced Breast Cancer Thermographic Classification: A Novel Quantum-Classical Integration Approach

2604.16953v1 by Riza Alaudin Syah, Irwan Alnarus Kautsar, Gunawan Witjaksono, Haza Nuzly bin Abdull Hamed

Breast cancer diagnosis through thermographic image analysis remains a critical challenge in medical AI, with classical deep learning approaches facing limitations in complex thermal pattern classification tasks. This paper presents a novel Hybrid Quantum Neural Network (HQNN) architecture that integrates quantum computing principles with classical convolutional neural networks for enhanced breast cancer classification. Our approach employs parameterized quantum circuits with multi-head attention mechanisms for quantum-aware feature encoding, coupled with classical convolutional layers for comprehensive pattern recognition. The quantum component utilizes a 4qubit variational circuit with strongly entangling layers, while the classical component incorporates advanced attention mechanisms for feature fusion. Experimental validation on breast cancer thermographic data demonstrates substantial performance improvements over state-of-the-art classical architectures, with the quantum-enhanced approach exhibiting superior convergence dynamics and enhanced feature representation capabilities. Our findings provide evidence for quantum advantage in medical image classification through classical simulation, establishing a framework for quantum-classical hybrid systems in healthcare applications. The methodology addresses key challenges in quantum machine learning deployment while maintaining computational feasibility on near-term quantum devices.

摘要:乳腺癌的熱成像圖像分析診斷在醫療人工智慧中仍然是一個關鍵挑戰,傳統深度學習方法在複雜的熱模式分類任務中面臨限制。本文提出了一種新穎的混合量子神經網絡(HQNN)架構,將量子計算原則與傳統卷積神經網絡相結合,以增強乳腺癌的分類。我們的方法採用了帶有多頭注意力機制的參數化量子電路進行量子感知特徵編碼,並結合傳統卷積層進行全面的模式識別。量子組件利用具有強耦合層的4量子位變分電路,而傳統組件則融合了先進的注意力機制以進行特徵融合。在乳腺癌熱成像數據上的實驗驗證顯示,與最先進的傳統架構相比,性能有顯著改善,量子增強的方法展現出優越的收斂動態和增強的特徵表示能力。我們的研究結果提供了量子優勢在醫療影像分類中的證據,通過傳統模擬建立了量子-傳統混合系統在醫療應用中的框架。該方法論解決了量子機器學習部署中的關鍵挑戰,同時在近期的量子設備上保持計算可行性。

LLMs can persuade only psychologically susceptible humans on societal issues, via trust in AI and emotional appeals, amid logical fallacies

2604.16935v1 by Alexis Carrillo, Salvatore Citraro, Ali Aghazhadeh Ardebili, Enrique Taietta, Giulio Rossetti, Emilio Ferrara, Giuseppe Alessandro Veltri, Massimo Stella

Scarce longitudinal evidence examines LLMs' persuasiveness and humanness along time-evolving psychological frameworks. We introduce Talk2AI, a longitudinal framework quantifying psycho-social, reasoning and affective dimensions of LLMs' persuasiveness about polarizing societal topics. In a four-way longitudinal setup, Talk2AI's 770 participants engaged in structured conversations with one of four leading LLMs on topics like climate change, social media misinformation, and math anxiety. This produced 3,080 conversations over 60,000 turns. After each wave, participants reported conviction in their initial topic stance, perceived opinion change, LLM's perceived humanness, a self-donation to the topic and a textual explanation. Feedback time series showed longitudinal inertia in convictions, indicating some human anchoring to initial opinions even after repeated exposure to AI-generated arguments. Interestingly, NLP analyses revealed that both humans and LLMs relied on fallacious reasoning in 1 conversational quip every 6, countering the ``LLMs as superior systems" stereotype behind LLMs' cognitive surrender. LLMs' perceived humanness was most learnable from sociodemographic, psychological and engagement features ($R^2=0.44$), followed by opinion change ($R^2=0.34$), conviction ($R^2=0.26$) and personal endowment ($R^2=0.24$). Crucially, explainable AI (XAI) indicated: (i) the presence of individuals more susceptible to LLM-based opinion changes; (ii) psychological susceptibility to LLM-convincing consisted of having more trust in LLMs, being more agreeable and extraverted and with a higher need for cognition. A multiverse approach with mixed-effects models confirmed XAI results, alongside strong individual differences. Talk2AI provides a grounded framework and evidence for detecting how GenAI can influence human opinions via multiple psycho-social pathways in AI-human digital platforms.

摘要:稀缺的縱向證據檢視了大型語言模型(LLMs)在隨時間演變的心理框架下的說服力和人性。我們介紹了 Talk2AI,這是一個縱向框架,用於量化 LLMs 在關於極具爭議的社會主題上的心理社會、推理和情感維度的說服力。在一個四方縱向設置中,Talk2AI 的 770 名參與者與四個主要 LLM 之一就氣候變化、社交媒體錯誤資訊和數學焦慮等主題進行了結構化對話。這產生了 3,080 次對話,總共超過 60,000 次回合。在每一波之後,參與者報告了他們對初始主題立場的信念、感知的意見變化、LLM 的感知人性、自我捐贈給主題的程度以及一段文字解釋。反饋時間序列顯示出信念的縱向慣性,表明即使在多次接觸 AI 生成的論點後,某些人類仍然對初始意見有一定的依附。有趣的是,自然語言處理(NLP)分析顯示,無論是人類還是 LLMs,每 6 次對話中就有 1 次依賴於謬誤推理,這反駁了「LLMs 作為優越系統」的刻板印象,揭示了 LLMs 的認知屈服。LLMs 的感知人性最能從社會人口學、心理學和參與特徵中學習到($R^2=0.44$),其次是意見變化($R^2=0.34$)、信念($R^2=0.26$)和個人捐贈($R^2=0.24$)。關鍵是,可解釋的 AI(XAI)顯示:(i)存在更易受 LLM 基於意見變化影響的個體;(ii)對 LLM 說服的心理易感性包括對 LLM 更有信任、更具同意性和外向性,以及更高的認知需求。一種多元宇宙方法與混合效應模型確認了 XAI 的結果,並顯示出強烈的個體差異。Talk2AI 提供了一個基礎框架和證據,以檢測生成式 AI 如何通過多種心理社會途徑影響人類意見,尤其是在 AI-人類數位平台上。

The Reliance Negotiation Framework: A Dynamic Process Model of Student LLM Engagement in Academic Writing

2604.16772v1 by Shahin Hossain

Student engagement with large language models (LLMs) in academic writing is not a stable trait, an adoption decision, or a competency level; it is a continuously negotiated process that existing frameworks cannot adequately theorize. Typological models provide categories without mechanisms; technology acceptance models explain adoption but not post-adoption quality; AI literacy frameworks treat competency as a static predictor rather than a live input. None accounts for within-student variability across tasks, the developmental paradox whereby experience produces habituation rather than sophistication, or principled non-use as a form of ethical reasoning. This article introduces the Reliance Negotiation Framework (RNF), developed from a sequential explanatory mixed-methods study of 382 undergraduates at a public minority-serving institution in the United States (survey, N = 382; 14 semi-structured interviews; three qualitative survey strands; 1,435 coded instances). The RNF reconceptualizes LLM reliance as an ongoing negotiation among four concurrent inputs (perceived benefits, perceived risks, ethical commitments, and situational demands) with outputs that recursively modify subsequent decisions. A Two-Model Architecture accommodates the 13.0% of participants whose categorical ethical commitments foreclose negotiation entirely. The framework generates four falsifiable predictions with implications for AI literacy pedagogy, academic integrity policy, and equity-centered practice at minority-serving institutions.

摘要:學生在學術寫作中與大型語言模型(LLMs)的互動並不是一個穩定的特徵、一個採用決策或一個能力水平;它是一個持續協商的過程,現有的框架無法充分理論化。類型模型提供了類別但沒有機制;技術接受模型解釋了採用但不解釋採用後的質量;人工智慧素養框架將能力視為靜態預測因子,而非動態輸入。這些都未考慮到學生在不同任務中的變異性、經驗產生習慣化而非精緻化的發展悖論,或作為倫理推理的一種形式的原則性不使用。本文介紹了依賴協商框架(Reliance Negotiation Framework, RNF),該框架是基於對美國一所公共少數族裔服務機構的382名本科生進行的順序解釋混合方法研究而開發的(調查,N = 382;14次半結構訪談;三個定性調查分支;1,435個編碼實例)。RNF重新概念化了對LLM的依賴,將其視為四個同時輸入(感知的好處、感知的風險、倫理承諾和情境需求)之間的持續協商,並且其輸出會遞歸性地修改後續決策。一個雙模型架構適應了13.0%的參與者,其類別倫理承諾完全排除了協商。該框架產生了四個可證偽的預測,對人工智慧素養教學、學術誠信政策和以公平為中心的少數族裔服務機構實踐具有啟示意義。

Why Training-Free Token Reduction Collapses: The Inherent Instability of Pairwise Scoring Signals

2604.16745v1 by Yang Shanglin

Training-free token reduction methods for Vision Transformers (ToMe, ToFu, PiToMe, and MCTF) employ different scoring mechanisms, yet they share a closely matched cliff-like collapse at high compression. This paper explains \emph{why}. We develop a diagnostic framework with two tools, ranking consistency $ρ_s$ and off-diagonal correlation $ρ_\text{off}$, that decomposes the collapse into (1)a signal-agnostic error amplifier inherent to layer-wise reduction, predicting convex Pareto curves and $r_{\text{crit}} \propto 1/L$; and (2)shared reliance on \emph{pairwise} similarity signals whose ranking consistency degrades from $ρ_s{=}0.88$ to $0.27$ in deep layers. Pairwise rankings are inherently unstable ($O(N_p^2)$ joint perturbations) while unary signals enjoy greater stability ($O(N_p)$ perturbations, CLT). From three design principles derived from this diagnosis, we construct CATIS as a constructive validation: unary signals raise the trigger threshold, triage suppresses the gain. On ViT-Large at 63% FLOPs reduction, CATIS retains 96.9% of vanilla accuracy (81.0%) on ImageNet-1K where all baselines collapse to 43--65%.

摘要:訓練無關的 Vision Transformers 令牌減少方法(ToMe、ToFu、PiToMe 和 MCTF)採用不同的評分機制,但它們在高壓縮時共享一個緊密匹配的懸崖式崩潰。本文解釋了 \emph{為什麼}。我們開發了一個診斷框架,包含兩個工具,排名一致性 $ρ_s$ 和非對角相關性 $ρ_\text{off}$,該框架將崩潰分解為(1)一個與信號無關的誤差放大器,這是層級減少固有的,預測凸的 Pareto 曲線和 $r_{\text{crit}} \propto 1/L$;以及(2)對 \emph{成對} 相似性信號的共同依賴,其排名一致性從深層的 $ρ_s{=}0.88$ 降低至 $0.27$。成對排名本質上不穩定($O(N_p^2)$ 聯合擾動),而單一信號則享有更大的穩定性($O(N_p)$ 擾動,中央極限定理)。基於這一診斷的三個設計原則,我們構建了 CATIS 作為一種建設性的驗證:單一信號提高觸發閾值,分類壓制增益。在 ViT-Large 63% FLOPs 減少的情況下,CATIS 在 ImageNet-1K 上保留了 96.9% 的原始準確率(81.0%),而所有基線的準確率均崩潰至 43--65%。

CT Open: An Open-Access, Uncontaminated, Live Platform for the Open Challenge of Clinical Trial Outcome Prediction

2604.16742v1 by Jianyou Wang, Youze Zheng, Longtian Bao, Hanyuan Zhang, Qirui Zheng, Yuhan Chen, Yang Zhang, Matthew Feng, Maxim Khan, Aditya K. Sehgal, Christopher D. Rosin, Ramamohan Paturi, Umber Dube, Leon Bergen

Scientists have long sought to accurately predict outcomes of real-world events before they happen. Can AI systems do so more reliably? We study this question through clinical trial outcome prediction, a high-stakes open challenge even for domain experts. We introduce CT Open, an open-access, live platform that will run four challenge every year. Anyone can submit predictions for each challenge. CT Open evaluates those submissions on trials whose outcomes were not yet public at the time of submission but were made public afterwards. Determining if a trial's outcome is public on the internet before a certain date is surprisingly difficult. Outcomes posted on official registries may lag behind by years, while the first mention may appear in obscure articles. To address this, we propose a novel, fully automated decontamination pipeline that uses iterative LLM-powered web search to identify the earliest mention of trial outcomes. We validate the pipeline's quality and accuracy by human expert's annotations. Since CT Open's pipeline ensures that every evaluated trial had no publicly reported outcome when the prediction was made, it allows participants to use any methodology and any data source. In this paper, we release a training set and two time-stamped test benchmarks, Winter 2025 and Summer 2025. We believe CT Open can serve as a central hub for advancing AI research on forecasting real-world outcomes before they occur, while also informing biomedical research and improving clinical trial design. CT Open Platform is hosted at $\href{https://ct-open.net/}{https://ct-open.net/}$

摘要:科學家們長期以來一直尋求在事件發生之前準確預測現實世界事件的結果。人工智慧系統能否更可靠地做到這一點?我們通過臨床試驗結果預測來研究這個問題,這對於領域專家來說是一個高風險的公開挑戰。我們介紹了 CT Open,一個每年舉辦四次挑戰的開放訪問即時平台。任何人都可以為每個挑戰提交預測。CT Open 在提交時對於那些結果尚未公開的試驗進行評估,但這些結果在之後會公開。確定某個試驗的結果在某個日期之前是否在互聯網上公開,實際上是相當困難的。官方登記處上發布的結果可能會滯後數年,而第一次提及可能出現在不知名的文章中。為了解決這個問題,我們提出了一個新穎的、完全自動化的去污流程,利用迭代的 LLM 驅動的網絡搜索來識別試驗結果的最早提及。我們通過人類專家的註釋來驗證該流程的質量和準確性。由於 CT Open 的流程確保每個被評估的試驗在預測時沒有公開報告的結果,因此參與者可以使用任何方法和任何數據來源。在本文中,我們發布了一個訓練集和兩個時間戳測試基準,分別是 2025 年冬季和 2025 年夏季。我們相信 CT Open 可以作為推進人工智慧研究以預測現實世界結果的中心樞紐,同時也能為生物醫學研究提供信息並改善臨床試驗設計。CT Open 平台托管於 $\href{https://ct-open.net/}{https://ct-open.net/}$

When Agents Go Quiet: Output Generation Capacity and Format-Cost Separation for LLM Document Synthesis

2604.16736v1 by Justice Owusu Agyemang, Michael Agyare, Miriam Kobbinah, Nathaniel Agbugblah, Prosper Addo

LLM-powered coding agents suffer from a poorly understood failure mode we term output stalling: the agent silently produces empty responses when attempting to generate large, format-heavy documents. We present a theoretical framework that explains and prevents this failure through three contributions. (1) We introduce Output Generation Capacity (OGC), a formal measure of an agent's effective ability to produce output given its current context state - distinct from and empirically smaller than the raw context window. (2) We prove a Format-Cost Separation Theorem showing that deferred template rendering is always at least as token-efficient as direct generation for any format with overhead multiplier $μ_f > 1$, and derive tight bounds on the savings. (3) We formalize Adaptive Strategy Selection, a decision framework that maps the ratio of estimated output cost to available OGC into an optimal generation strategy (direct, chunked, or deferred). We validate the theory through controlled experiments across three models (Claude 3.5 Sonnet, GPT-4o, Llama 3.1 70B), four document types, and an ablation study isolating each component's contribution. Deferred rendering reduces LLM generation tokens by 48-72% across all conditions and eliminates output stalling entirely. We instantiate the framework as GEN-PILOT, an open-source MCP server, demonstrating that the theory translates directly into a practical tool.

摘要:LLM 驅動的編碼代理面臨一種我們稱之為輸出停滯的失效模式,這種模式尚未被充分理解:當嘗試生成大型、格式繁重的文檔時,代理會靜默地產生空響應。我們提出了一個理論框架,通過三個貢獻來解釋和防止這種失效。(1) 我們引入了輸出生成能力(Output Generation Capacity, OGC),這是一種正式的度量,用於衡量代理在當前上下文狀態下有效產生輸出的能力——這與原始上下文窗口不同,並且經驗上較小。(2) 我們證明了一個格式-成本分離定理,顯示延遲模板渲染在任何具有開銷乘數 $μ_f > 1$ 的格式下,至少與直接生成一樣具有效率,並推導出節省的緊密界限。(3) 我們形式化了自適應策略選擇,這是一個決策框架,將估計的輸出成本與可用的 OGC 的比率映射到最佳生成策略(直接、分塊或延遲)。我們通過對三個模型(Claude 3.5 Sonnet、GPT-4o、Llama 3.1 70B)、四種文檔類型以及一項孤立每個組件貢獻的消融研究進行控制實驗來驗證該理論。延遲渲染在所有條件下將 LLM 生成的標記減少了 48-72%,並完全消除了輸出停滯。我們將該框架實例化為 GEN-PILOT,一個開源的 MCP 伺服器,展示了該理論如何直接轉化為實用工具。

Agentic Large Language Models for Training-Free Neuro-Radiological Image Analysis

2604.16729v1 by Ayhan Can Erdur, Daniel Scholz, Jiazhen Pan, Benedikt Wiestler, Daniel Rueckert, Jan C. Peeken

State-of-the-art large language models (LLMs) show high performance in general visual question answering. However, a fundamental limitation remains: current architectures lack the native 3D spatial reasoning required for direct analysis of volumetric medical imaging, such as CT or MRI. Emerging agentic AI offers a new solution, eliminating the need for intrinsic 3D processing by enabling LLMs to orchestrate and leverage specialized external tools. Yet, the feasibility of such agentic frameworks in complex, multi-step radiological workflows remains underexplored. In this work, we present a training-free agentic pipeline for automated brain MRI analysis. Validating our methodology on several LLMs (GPT-5.1, Gemini 3 Pro, Claude Sonnet 4.5) with off-the-shelf domain-specific tools, our system autonomously executes complex end-to-end workflows, including preprocessing (skull stripping, registration), pathology segmentation (glioma, meningioma, metastases), and volumetric analysis. We evaluate our framework across increasingly complex radiological tasks, from single-scan segmentation and volumetric reporting to longitudinal response assessment requiring multi-timepoint comparisons. We analyze the impact of architectural design by comparing single-agent models against multi-agent "domain-expert" collaborations. Finally, to support rigorous evaluation of future agentic systems, we introduce and release a benchmark dataset of image-prompt-answer tuples derived from public BraTS data. Our results demonstrate that agentic AI can solve highly neuro-radiological image analysis tasks through tool use without the need for training or fine-tuning.

摘要:最先進的大型語言模型(LLMs)在一般視覺問題回答方面表現出色。然而,仍然存在一個根本性的限制:當前的架構缺乏進行體積醫學影像(如 CT 或 MRI)直接分析所需的原生 3D 空間推理。新興的代理 AI 提供了一種新解決方案,通過使 LLM 能夠協調和利用專門的外部工具,消除了對內在 3D 處理的需求。然而,這種代理框架在複雜的多步放射學工作流程中的可行性仍未得到充分探索。在這項工作中,我們提出了一個無需訓練的代理管道,用於自動化腦部 MRI 分析。我們在幾個 LLM(GPT-5.1、Gemini 3 Pro、Claude Sonnet 4.5)上驗證我們的方法,並使用現成的領域專用工具,我們的系統自主執行複雜的端到端工作流程,包括預處理(去顱骨、註冊)、病理分割(膠質瘤、腦膜瘤、轉移瘤)和體積分析。我們在越來越複雜的放射學任務中評估我們的框架,從單掃描分割和體積報告到需要多時間點比較的縱向反應評估。我們通過比較單代理模型與多代理「領域專家」合作來分析架構設計的影響。最後,為了支持未來代理系統的嚴格評估,我們引入並發布了一個基準數據集,該數據集由來自公共 BraTS 數據的圖像-提示-答案元組組成。我們的結果表明,代理 AI 可以通過工具使用解決高度神經放射學影像分析任務,而無需訓練或微調。

The Query Channel: Information-Theoretic Limits of Masking-Based Explanations

2604.16689v1 by Erciyes Karakaya, Ozgur Ercetin

Masking-based post-hoc explanation methods, such as KernelSHAP and LIME, estimate local feature importance by querying a black-box model under randomized perturbations. This paper formulates this procedure as communication over a query channel, where the latent explanation acts as a message and each masked evaluation is a channel use. Within this framework, the complexity of the explanation is captured by the entropy of the hypothesis class, while the query interface supplies information at a rate determined by an identification capacity per query. We derive a strong converse showing that, if the explanation rate exceeds this capacity, the probability of exact recovery necessarily converges to one in error for any sequence of explainers and decoders. We also prove an achievability result establishing that a sparse maximum-likelihood decoder attains reliable recovery when the rate lies below capacity. A Monte Carlo estimator of mutual information yields a non-asymptotic query benchmark that we use to compare optimal decoding with Lasso- and OLS-based procedures that mirror LIME and KernelSHAP. Experiments reveal a range of query budgets where information theory permits reliable explanations but standard convex surrogates still fail. Finally, we interpret super-pixel resolution and tokenization for neural language models as a source-coding choice that sets the entropy of the explanation and show how Gaussian noise and nonlinear curvature degrade the query channel, induce waterfall and error-floor behavior, and render high-resolution explanations unattainable.

摘要:基於遮蔽的後置解釋方法,如 KernelSHAP 和 LIME,通過在隨機擾動下查詢黑箱模型來估計局部特徵的重要性。本文將此過程表述為通過查詢通道的通信,其中潛在解釋作為消息,而每次遮蔽評估則是一個通道使用。在這一框架內,解釋的複雜性由假設類的熵來捕捉,而查詢接口則以每次查詢的識別容量決定的信息速率提供信息。我們推導出一個強對偶,顯示如果解釋速率超過這一容量,則對於任何解釋者和解碼器的序列,精確恢復的概率必然收斂於錯誤的概率為一。我們還證明了一個可達性結果,確立當速率低於容量時,稀疏最大似然解碼器能夠實現可靠的恢復。一個互信息的蒙特卡洛估計器產生了一個非漸近查詢基準,我們用它來比較最優解碼與類似 LIME 和 KernelSHAP 的 Lasso 和 OLS 基礎程序。實驗揭示了一系列查詢預算,其中信息理論允許可靠的解釋,但標準凸代理仍然失敗。最後,我們將超像素解析度和神經語言模型的標記化解釋為一種源編碼選擇,這設定了解釋的熵,並展示高斯噪聲和非線性曲率如何降低查詢通道的質量,誘導瀑布和錯誤地板行為,並使高解析度解釋變得無法實現。

Using Large Language Models and Knowledge Graphs to Improve the Interpretability of Machine Learning Models in Manufacturing

2604.16280v1 by Thomas Bayer, Alexander Lohr, Sarah Weiß, Bernd Michelberger, Wolfram Höpken

Explaining Machine Learning (ML) results in a transparent and user-friendly manner remains a challenging task of Explainable Artificial Intelligence (XAI). In this paper, we present a method to enhance the interpretability of ML models by using a Knowledge Graph (KG). We store domain-specific data along with ML results and their corresponding explanations, establishing a structured connection between domain knowledge and ML insights. To make these insights accessible to users, we designed a selective retrieval method in which relevant triplets are extracted from the KG and processed by a Large Language Model (LLM) to generate user-friendly explanations of ML results. We evaluated our method in a manufacturing environment using the XAI Question Bank. Beyond standard questions, we introduce more complex, tailored questions that highlight the strengths of our approach. We evaluated 33 questions, analyzing responses using quantitative metrics such as accuracy and consistency, as well as qualitative ones such as clarity and usefulness. Our contribution is both theoretical and practical: from a theoretical perspective, we present a novel approach for effectively enabling LLMs to dynamically access a KG in order to improve the explainability of ML results. From a practical perspective, we provide empirical evidence showing that such explanations can be successfully applied in real-world manufacturing environments, supporting better decision-making in manufacturing processes.

摘要:解釋機器學習(ML)結果以透明且使用者友好的方式仍然是可解釋人工智慧(XAI)的一項挑戰性任務。在本文中,我們提出了一種通過使用知識圖譜(KG)來增強ML模型可解釋性的方法。我們儲存領域特定的數據以及ML結果及其相應的解釋,建立領域知識與ML見解之間的結構化連結。為了使這些見解對使用者可訪問,我們設計了一種選擇性檢索方法,從KG中提取相關的三元組,並由大型語言模型(LLM)處理,以生成使用者友好的ML結果解釋。我們在製造環境中使用XAI問題庫評估我們的方法。除了標準問題外,我們還引入了更複雜、量身定制的問題,以突顯我們方法的優勢。我們評估了33個問題,使用準確性和一致性等定量指標,以及清晰度和有用性等定性指標來分析回應。我們的貢獻既具有理論性也具有實踐性:從理論的角度來看,我們提出了一種新穎的方法,能夠有效地使LLM動態訪問KG,以改善ML結果的可解釋性。從實踐的角度來看,我們提供了實證證據,顯示這樣的解釋可以成功應用於現實世界的製造環境中,支持更好的製造過程決策。

MARCH: Multi-Agent Radiology Clinical Hierarchy for CT Report Generation

2604.16175v1 by Yi Lin, Yihao Ding, Yonghui Wu, Yifan Peng

Automated 3D radiology report generation often suffers from clinical hallucinations and a lack of the iterative verification found in human practice. While recent Vision-Language Models (VLMs) have advanced the field, they typically operate as monolithic "black-box" systems without the collaborative oversight characteristic of clinical workflows. To address these challenges, we propose MARCH (Multi-Agent Radiology Clinical Hierarchy), a multi-agent framework that emulates the professional hierarchy of radiology departments and assigns specialized roles to distinct agents. MARCH utilizes a Resident Agent for initial drafting with multi-scale CT feature extraction, multiple Fellow Agents for retrieval-augmented revision, and an Attending Agent that orchestrates an iterative, stance-based consensus discourse to resolve diagnostic discrepancies. On the RadGenome-ChestCT dataset, MARCH significantly outperforms state-of-the-art baselines in both clinical fidelity and linguistic accuracy. Our work demonstrates that modeling human-like organizational structures enhances the reliability of AI in high-stakes medical domains.

摘要:自動化的 3D 放射學報告生成常常遭受臨床幻覺和缺乏人類實踐中所見的迭代驗證的問題。儘管近期的視覺-語言模型(VLMs)已經推進了該領域,但它們通常作為單一的「黑箱」系統運作,缺乏臨床工作流程中典型的協作監督。為了解決這些挑戰,我們提出了 MARCH(多代理放射學臨床層級),這是一個多代理框架,模擬放射學部門的專業層級,並為不同的代理分配專門角色。MARCH 利用住院醫師代理進行初步草擬,並進行多尺度 CT 特徵提取,使用多個研究員代理進行檢索增強的修訂,以及一位主治醫師代理協調基於立場的迭代共識討論,以解決診斷差異。在 RadGenome-ChestCT 數據集上,MARCH 在臨床忠實度和語言準確性方面顯著超越了最先進的基準。我們的研究表明,模擬類人組織結構可以提高人工智慧在高風險醫療領域的可靠性。

Can LLMs Understand the Impact of Trauma? Costs and Benefits of LLMs Coding the Interviews of Firearm Violence Survivors

2604.16132v1 by Jessica H. Zhu, Shayla Stringfield, Vahe Zaprosyan, Michael Wagner, Michel Cukier, Joseph B. Richardson

Firearm violence is a pressing public health issue, yet research into survivors' lived experiences remains underfunded and difficult to scale. Qualitative research, including in-depth interviews, is a valuable tool for understanding the personal and societal consequences of community firearm violence and designing effective interventions. However, manually analyzing these narratives through thematic analysis and inductive coding is time-consuming and labor-intensive. Recent advancements in large language models (LLMs) have opened the door to automating this process, though concerns remain about whether these models can accurately and ethically capture the experiences of vulnerable populations. In this study, we assess the use of open-source LLMs to inductively code interviews with 21 Black men who have survived community firearm violence. Our results demonstrate that while some configurations of LLMs can identify important codes, overall relevance remains low and is highly sensitive to data processing. Furthermore, LLM guardrails lead to substantial narrative erasure. These findings highlight both the potential and limitations of LLM-assisted qualitative coding and underscore the ethical challenges of applying AI in research involving marginalized communities.

摘要:槍支暴力是一個緊迫的公共衛生問題,但對於倖存者生活經歷的研究仍然資金不足且難以擴展。質性研究,包括深入訪談,是理解社區槍支暴力的個人和社會後果以及設計有效干預措施的寶貴工具。然後,通過主題分析和歸納編碼手動分析這些敘事既耗時又勞動密集。最近大型語言模型(LLMs)的進展為自動化這一過程打開了大門,但仍然存在這些模型是否能準確和倫理地捕捉弱勢群體經歷的擔憂。在這項研究中,我們評估使用開源LLMs對21名倖存於社區槍支暴力的黑人男性的訪談進行歸納編碼。我們的結果顯示,儘管某些LLMs的配置能夠識別重要的編碼,但整體相關性仍然較低,並且對數據處理高度敏感。此外,LLM的防護措施導致了實質性的敘事抹除。這些發現突顯了LLM輔助質性編碼的潛力和局限性,並強調了在涉及邊緣社區的研究中應用AI的倫理挑戰。

Dual-Modal Lung Cancer AI: Interpretable Radiology and Microscopy with Clinical Risk Integration

2604.16104v1 by Baramee Sukumal, Aueaphum Aueawatthanaphisut

Lung cancer remains one of the leading causes of cancer-related mortality worldwide. Conventional computed tomography (CT) imaging, while essential for detection and staging, has limitations in distinguishing benign from malignant lesions and providing interpretable diagnostic insights. To address this challenge, this study proposes a dual-modal artificial intelligence framework that integrates CT radiology with hematoxylin and eosin (H&E) histopathology for lung cancer diagnosis and subtype classification. The system employs convolutional neural networks to extract radiologic and histopathologic features and incorporates clinical metadata to improve robustness. Predictions from both modalities are fused using a weighted decision-level integration mechanism to classify adenocarcinoma, squamous cell carcinoma, large cell carcinoma, small cell lung cancer, and normal tissue. Explainable AI techniques including Grad-CAM, Grad-CAM++, Integrated Gradients, Occlusion, Saliency Maps, and SmoothGrad are applied to provide visual interpretability. Experimental results show strong performance with accuracy up to 0.87, AUROC above 0.97, and macro F1-score of 0.88. Grad-CAM++ achieved the highest faithfulness and localization accuracy, demonstrating strong correspondence with expert-annotated tumor regions. These results indicate that multimodal fusion of radiology and histopathology can improve diagnostic performance while maintaining model transparency, suggesting potential for future clinical decision support systems in precision oncology.

摘要:肺癌仍然是全球癌症相關死亡的主要原因之一。傳統的電腦斷層掃描 (CT) 成像雖然對於檢測和分期至關重要,但在區分良性和惡性病變以及提供可解釋的診斷見解方面存在局限性。為了解決這一挑戰,本研究提出了一個雙模態人工智慧框架,將 CT 放射學與蘇木精-伊紅 (H&E) 組織病理學整合,用於肺癌的診斷和亞型分類。該系統採用卷積神經網絡提取放射學和組織病理學特徵,並結合臨床元數據以提高穩健性。來自兩種模態的預測通過加權決策級整合機制進行融合,以分類腺癌、鱗狀細胞癌、大細胞癌、小細胞肺癌和正常組織。應用可解釋的人工智慧技術,包括 Grad-CAM、Grad-CAM++、集成梯度、遮蔽、顯著性圖和 SmoothGrad,以提供視覺可解釋性。實驗結果顯示出強勁的性能,準確率高達 0.87,AUROC 超過 0.97,宏觀 F1 分數為 0.88。Grad-CAM++ 在忠實度和定位準確性方面達到了最高水平,顯示出與專家標註的腫瘤區域之間的強對應關係。這些結果表明,放射學和組織病理學的多模態融合可以提高診斷性能,同時保持模型透明度,暗示未來在精準腫瘤學中用於臨床決策支持系統的潛力。

Towards Intrinsic Interpretability of Large Language Models:A Survey of Design Principles and Architectures

2604.16042v2 by Yutong Gao, Qinglin Meng, Yuan Zhou, Liangming Pan

While Large Language Models (LLMs) have achieved strong performance across many NLP tasks, their opaque internal mechanisms hinder trustworthiness and safe deployment. Existing surveys in explainable AI largely focus on post-hoc explanation methods that interpret trained models through external approximations. In contrast, intrinsic interpretability, which builds transparency directly into model architectures and computations, has recently emerged as a promising alternative. This paper presents a systematic review of the recent advances in intrinsic interpretability for LLMs, categorizing existing approaches into five design paradigms: functional transparency, concept alignment, representational decomposability, explicit modularization, and latent sparsity induction. We further discuss open challenges and outline future research directions in this emerging field. The paper list is available at: https://github.com/PKU-PILLAR-Group/Survey-Intrinsic-Interpretability-of-LLMs.

摘要:雖然大型語言模型(LLMs)在許多自然語言處理任務中取得了強勁的表現,但其不透明的內部機制妨礙了可信度和安全部署。現有的可解釋人工智慧調查主要集中在事後解釋方法,這些方法通過外部近似來解釋訓練好的模型。相比之下,內在可解釋性,直接將透明度構建到模型架構和計算中,最近出現作為一種有前途的替代方案。本文系統性地回顧了LLMs內在可解釋性的最新進展,將現有的方法分為五種設計範式:功能透明性、概念對齊、表徵可分解性、明確模組化和潛在稀疏性引導。我們進一步討論了開放挑戰並概述了這一新興領域的未來研究方向。論文列表可在以下網址獲得:https://github.com/PKU-PILLAR-Group/Survey-Intrinsic-Interpretability-of-LLMs。

Evaluating Temporal and Structural Anomaly Detection Paradigms for DDoS Traffic

2604.16575v1 by Yasmin Souza Lima, Rodrigo Moreira, Larissa F. Rodrigues Moreira, Tereza Cristina M. de B. Carvalho, Flávio de Oliveira Silva

Unsupervised anomaly detection is widely used to detect Distributed Denial-of-Service (DDoS) attacks in cloud-native 5G networks, yet most studies assume a fixed traffic representation, either temporal or structural, without validating which feature space best matches the data. We propose a lightweight decision framework that prioritizes temporal or structural features before training, using two diagnostics: lag-1 autocorrelation of an aggregated flow signal and PCA cumulative explained variance. When the probes are inconclusive, the framework reserves a hybrid option as a future fallback rather than an empirically validated branch. Experiments on two statistically distinct datasets with Isolation Forest, One-Class SVM, and KMeans show that structural features consistently match or outperform temporal ones, with the performance gap widening as temporal dependence weakens.

摘要:無監督異常檢測在雲原生5G網絡中廣泛用於檢測分佈式拒絕服務(DDoS)攻擊,但大多數研究假設固定的流量表示,無論是時間性還是結構性,卻未驗證哪種特徵空間最符合數據。我們提出了一個輕量級的決策框架,在訓練之前優先考慮時間性或結構性特徵,使用兩個診斷指標:聚合流信號的滯後1自相關和PCA累積解釋變異數。當探測結果不確定時,該框架保留了一個混合選項作為未來的備用方案,而不是經過實證驗證的分支。在兩個統計上不同的數據集上進行的實驗,使用孤立森林、單類SVM和KMeans顯示,結構性特徵始終與時間性特徵相匹配或表現更佳,隨著時間依賴性減弱,性能差距擴大。

Towards Rigorous Explainability by Feature Attribution

2604.15898v1 by Olivier Létoffé, Xuanxiang Huang, Joao Marques-Silva

For around a decade, non-symbolic methods have been the option of choice when explaining complex machine learning (ML) models. Unfortunately, such methods lack rigor and can mislead human decision-makers. In high-stakes uses of ML, the lack of rigor is especially problematic. One prime example of provable lack of rigor is the adoption of Shapley values in explainable artificial intelligence (XAI), with the tool SHAP being a ubiquitous example. This paper overviews the ongoing efforts towards using rigorous symbolic methods of XAI as an alternative to non-rigorous non-symbolic approaches, concretely for assigning relative feature importance.

摘要:在過去十年中,非符號方法一直是解釋複雜機器學習(ML)模型的首選。不幸的是,這些方法缺乏嚴謹性,可能會誤導人類決策者。在高風險的機器學習應用中,缺乏嚴謹性尤其成為一個問題。一個明顯的缺乏嚴謹性的例子是沙普利值在可解釋人工智慧(XAI)中的採用,其中工具SHAP是一個無處不在的例子。本文概述了正在進行的努力,旨在使用嚴謹的符號方法作為非嚴謹的非符號方法的替代,具體用於分配相對特徵的重要性。

Closing the Theory-Practice Gap in Spiking Transformers via Effective Dimension

2604.15769v1 by Dongxin Guo, Jikun Wu, Siu Ming Yiu

Spiking transformers achieve competitive accuracy with conventional transformers while offering $38$-$57\times$ energy efficiency on neuromorphic hardware, yet no theoretical framework guides their design. This paper establishes the first comprehensive expressivity theory for spiking self-attention. We prove that spiking attention with Leaky Integrate-and-Fire neurons is a universal approximator of continuous permutation-equivariant functions, providing explicit spike circuit constructions including a novel lateral inhibition network for softmax normalization with proven $O(1/\sqrt{T})$ convergence. We derive tight spike-count lower bounds via rate-distortion theory: $\varepsilon$-approximation requires $Ω(L_f^2 nd/\varepsilon^2)$ spikes, with rigorous information-theoretic derivation. Our key insight is input-dependent bounds using measured effective dimensions ($d_{\text{eff}}=47$--$89$ for CIFAR/ImageNet), explaining why $T=4$ timesteps suffice despite worst-case $T \geq 10{,}000$ predictions. We provide concrete design rules with calibrated constants ($C=2.3$, 95\% CI: $[1.9, 2.7]$). Experiments on Spikformer, QKFormer, and SpikingResformer across vision and language benchmarks validate predictions with $R^2=0.97$ ($p<0.001$). Our framework provides the first principled foundation for neuromorphic transformer design.

摘要:尖峰Transformer在傳統Transformer中達到了競爭性的準確性,同時在神經形態硬體上提供了 $38$-$57\times$ 的能量效率,但目前沒有理論框架指導其設計。本文建立了尖峰自注意力的首個綜合表達理論。我們證明了使用漏積分和發火神經元的尖峰注意力是連續置換等變函數的通用近似器,並提供了明確的尖峰電路構造,包括一個新穎的側抑制網絡,用於軟最大化正規化,並證明了 $O(1/\sqrt{T})$ 的收斂性。我們通過率失真理論推導出緊的尖峰計數下界:$\varepsilon$-近似需要 $Ω(L_f^2 nd/\varepsilon^2)$ 個尖峰,並進行了嚴謹的信息理論推導。我們的關鍵見解是使用測量的有效維度($d_{\text{eff}}=47$--$89$,針對 CIFAR/ImageNet)的輸入依賴性界限,解釋了為什麼 $T=4$ 個時間步驟足夠,儘管在最壞情況下 $T \geq 10{,}000$ 的預測。我們提供了具有校準常數的具體設計規則($C=2.3$,95\% CI: $[1.9, 2.7]$)。在視覺和語言基準上對 Spikformer、QKFormer 和 SpikingResformer 的實驗驗證了預測,$R^2=0.97$ ($p<0.001$)。我們的框架為神經形態Transformer設計提供了首個原則性基礎。

LLM Reasoning Is Latent, Not the Chain of Thought

2604.15726v1 by Wenshuo Wang

This position paper argues that large language model (LLM) reasoning should be studied as latent-state trajectory formation rather than as faithful surface chain-of-thought (CoT). This matters because claims about faithfulness, interpretability, reasoning benchmarks, and inference-time intervention all depend on what the field takes the primary object of reasoning to be. We ask what that object should be once three often-confounded factors are separated and formalize three competing hypotheses: H1, reasoning is primarily mediated by latent-state trajectories; H2, reasoning is primarily mediated by explicit surface CoT; and H0, most apparent reasoning gains are better explained by generic serial compute than by any privileged representational object. Reorganizing recent empirical, mechanistic, and survey work under this framework, and adding compute-audited worked exemplars that factorize surface traces, latent interventions, and matched budget expansions, we find that current evidence most strongly supports H1 as a default working hypothesis rather than as a task-independent verdict. We therefore make two recommendations: the field should treat latent-state dynamics as the default object of study for LLM reasoning, and it should evaluate reasoning with designs that explicitly disentangle surface traces, latent states, and serial compute.

摘要:這份立場文件主張,大型語言模型(LLM)的推理應該被研究為潛在狀態軌跡的形成,而不是忠實的表面思維鏈(CoT)。這一點很重要,因為關於忠實性、可解釋性、推理基準和推理時干預的主張都取決於該領域認為推理的主要對象是什麼。我們詢問當三個經常混淆的因素被分開時,那個對象應該是什麼,並正式化三個競爭假設:H1,推理主要是通過潛在狀態軌跡來介導的;H2,推理主要是通過明確的表面思維鏈來介導的;而H0,大多數明顯的推理增益更好地被一般的串行計算解釋,而不是任何特權的表徵對象。在這個框架下重新組織最近的實證、機制和調查工作,並添加經過計算審核的示例,這些示例將表面痕跡、潛在干預和匹配的預算擴展進行因式分解,我們發現當前的證據最強烈地支持H1作為默認的工作假設,而不是作為一個任務獨立的裁決。因此,我們提出兩項建議:該領域應將潛在狀態動力學視為LLM推理的默認研究對象,並應以明確區分表面痕跡、潛在狀態和串行計算的設計來評估推理。

LLM attribution analysis across different fine-tuning strategies and model scales for automated code compliance

2604.15589v1 by Jack Wei Lun Shi, Minghao Dang, Wawan Solihin, Justin K. W. Yeoh

Existing research on large language models (LLMs) for automated code compliance has primarily focused on performance, treating the models as black boxes and overlooking how training decisions affect their interpretive behavior. This paper addresses this gap by employing a perturbation-based attribution analysis to compare the interpretive behaviors of LLMs across different fine-tuning strategies such as full fine-tuning (FFT), low-rank adaptation (LoRA) and quantized LoRA fine-tuning, as well as the impact of model scales which include varying LLM parameter sizes. Our results show that FFT produces attribution patterns that are statistically different and more focused than those from parameter-efficient fine-tuning methods. Furthermore, we found that as model scale increases, LLMs develop specific interpretive strategies such as prioritizing numerical constraints and rule identifiers in the building text, albeit with performance gains in semantic similarity of the generated and reference computer-processable rules plateauing for models larger than 7B. This paper provides crucial insights into the explainability of these models, taking a step toward building more transparent LLMs for critical, regulation-based tasks in the Architecture, Engineering, and Construction industry.

摘要:現有關於自動代碼合規的大型語言模型(LLMs)研究主要集中在性能上,將模型視為黑箱,忽視了訓練決策如何影響其解釋行為。本文通過採用基於擾動的歸因分析來填補這一空白,對比不同微調策略(如完全微調(FFT)、低秩適應(LoRA)和量化LoRA微調)下LLMs的解釋行為,以及包括不同LLM參數大小的模型規模對其影響。我們的結果顯示,FFT產生的歸因模式在統計上與參數高效微調方法的模式不同,且更具針對性。此外,我們發現隨著模型規模的增加,LLMs發展出特定的解釋策略,例如在構建文本中優先考慮數值約束和規則識別符,儘管生成的計算機可處理規則與參考規則的語義相似性在超過7B的模型中達到平臺期。本文為這些模型的可解釋性提供了重要見解,邁出了為建築、工程和建設行業的關鍵規範性任務構建更透明的LLMs的一步。

Towards Reliable Testing of Machine Unlearning

2604.16536v1 by Anna Mazhar, Sainyam Galhotra

Machine learning components are now central to AI-infused software systems, from recommendations and code assistants to clinical decision support. As regulations and governance frameworks increasingly require deleting sensitive data from deployed models, machine unlearning is emerging as a practical alternative to full retraining. However, unlearning introduces a software quality-assurance challenge: under realistic deployment constraints and imperfect oracles, how can we test that a model no longer relies on targeted information? This paper frames unlearning testing as a first-class software engineering problem. We argue that practical unlearning tests must provide (i) thorough coverage over proxy and mediated influence pathways, (ii) debuggable diagnostics that localize where leakage persists, (iii) cost-effective regression-style execution under query budgets, and (iv) black-box applicability for API-deployed models. We outline a causal, pathway-centric perspective, causal fuzzing, that generates budgeted interventions to estimate residual direct and indirect effects and produce actionable "leakage reports". Proof-of-concept results illustrate that standard attribution checks can miss residual influence due to proxy pathways, cancellation effects, and subgroup masking, motivating causal testing as a promising direction for unlearning testing.

摘要:機器學習組件現在已成為融入人工智慧的軟體系統的核心,從推薦系統和程式碼助手到臨床決策支持。隨著法規和治理框架越來越要求從已部署模型中刪除敏感數據,機器遺忘作為完全重新訓練的實用替代方案正在出現。然而,遺忘帶來了一個軟體質量保證的挑戰:在現實的部署限制和不完美的預測下,我們如何測試一個模型不再依賴於目標資訊?本文將遺忘測試框架化為一個一流的軟體工程問題。我們主張實用的遺忘測試必須提供 (i) 對代理和中介影響路徑的全面覆蓋,(ii) 可調試的診斷,定位洩漏持續的地方,(iii) 在查詢預算下的成本效益回歸風格執行,以及 (iv) 對 API 部署模型的黑箱適用性。我們概述了一種因果、以路徑為中心的視角,即因果模糊測試,生成預算干預以估算殘留的直接和間接效果,並產生可行的「洩漏報告」。概念驗證結果顯示,標準的歸因檢查可能會因代理路徑、抵消效應和子群體掩蔽而錯過殘留影響,這促使因果測試成為遺忘測試的一個有前景的方向。

Beyond Attack Success Rate: A Multi-Metric Evaluation of Adversarial Transferability in Medical Imaging Models

2604.16532v1 by Emily Curl, Kofi Ampomah, Md Erfan, Sayanton Dibbo

While deep learning systems are becoming increasingly prevalent in medical image analysis, their vulnerabilities to adversarial perturbations raise serious concerns for clinical deployment. These vulnerability evaluations largely rely on Attack Success Rate (ASR), a binary metric that indicates solely whether an attack is successful. However, the ASR metric does not account for other factors, such as perturbation strength, perceptual image quality, and cross-architecture attack transferability, and therefore, the interpretation is incomplete. This gap requires consideration, as complex, large-scale deep learning systems, including Vision Transformers (ViTs), are increasingly challenging the dominance of Convolutional Neural Networks (CNNs). These architectures learn differently, and it is unclear whether a single metric, e.g., ASR, can effectively capture adversarial behavior. To address this, we perform a systematic empirical study on four medical image datasets: PathMNIST, DermaMNIST, RetinaMNIST, and CheXpert. We evaluate seven models (VGG-16, ResNet-50, DenseNet-121, Inception-v3, DeiT, Swin Transformer, and ViT-B/16) against seven attack methods at five perturbation budgets, measuring ASR, Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index Measure (SSIM), and $L_2$ perturbation magnitude. Our findings show a consistent pattern: perceptual and distortion metrics are strongly associated with one another and exhibit minimal correlation with ASR. This applies to both CNNs and ViTs. The results demonstrate that ASR alone is an inadequate indicator of adversarial robustness and transferability. Consequently, we argue that a thorough assessment of adversarial risk in medical AI necessitates multi-metric frameworks that encompass not only the attack efficacy but also its methodology and associated overheads.

摘要:雖然深度學習系統在醫學影像分析中變得越來越普遍,但它們對對抗性擾動的脆弱性對臨床部署提出了嚴重的擔憂。這些脆弱性評估在很大程度上依賴於攻擊成功率(ASR),這是一個二元指標,僅指示攻擊是否成功。然而,ASR指標並未考慮其他因素,例如擾動強度、感知影像質量和跨架構攻擊可轉移性,因此其解釋是不完整的。這一缺口需要考慮,因為複雜的大規模深度學習系統,包括視覺Transformer(ViTs),正日益挑戰卷積神經網絡(CNNs)的主導地位。這些架構的學習方式不同,目前尚不清楚單一指標,例如ASR,是否能有效捕捉對抗行為。為了解決這個問題,我們對四個醫學影像數據集進行了系統的實證研究:PathMNIST、DermaMNIST、RetinaMNIST和CheXpert。我們對七個模型(VGG-16、ResNet-50、DenseNet-121、Inception-v3、DeiT、Swin Transformer和ViT-B/16)在五個擾動預算下進行了七種攻擊方法的評估,測量ASR、峰值信噪比(PSNR)、結構相似性指數度量(SSIM)和$L_2$擾動幅度。我們的研究結果顯示出一致的模式:感知和失真指標之間有很強的關聯性,並且與ASR的相關性極小。這一點適用於CNN和ViT。結果顯示,僅僅依賴ASR並不足以指標對抗穩健性和可轉移性。因此,我們認為對醫學人工智慧的對抗風險進行徹底評估需要多指標框架,不僅涵蓋攻擊效能,還包括其方法論和相關的開銷。

DeepER-Med: Advancing Deep Evidence-Based Research in Medicine Through Agentic AI

2604.15456v1 by Zhizheng Wang, Chih-Hsuan Wei, Joey Chan, Robert Leaman, Chi-Ping Day, Chuan Wu, Mark A Knepper, Antolin Serrano Farias, Jordina Rincon-Torroella, Hasan Slika, Betty Tyler, Ryan Huu-Tuan Nguyen, Asmita Indurkar, Mélanie Hébert, Shubo Tian, Lauren He, Noor Naffakh, Aseem Aseem, Nicholas Wan, Emily Y Chew, Tiarnan D L Keenan, Zhiyong Lu

Trustworthiness and transparency are essential for the clinical adoption of artificial intelligence (AI) in healthcare and biomedical research. Recent deep research systems aim to accelerate evidence-grounded scientific discovery by integrating AI agents with multi-hop information retrieval, reasoning, and synthesis. However, most existing systems lack explicit and inspectable criteria for evidence appraisal, creating a risk of compounding errors and making it difficult for researchers and clinicians to assess the reliability of their outputs. In parallel, current benchmarking approaches rarely evaluate performance on complex, real-world medical questions. Here, we introduce DeepER-Med, a Deep Evidence-based Research framework for Medicine with an agentic AI system. DeepER-Med frames deep medical research as an explicit and inspectable workflow of evidence-based generation, consisting of three modules: research planning, agentic collaboration, and evidence synthesis. To support realistic evaluation, we also present DeepER-MedQA, an evidence-grounded dataset comprising 100 expert-level research questions derived from authentic medical research scenarios and curated by a multidisciplinary panel of 11 biomedical experts. Expert manual evaluation demonstrates that DeepER-Med consistently outperforms widely used production-grade platforms across multiple criteria, including the generation of novel scientific insights. We further demonstrate the practical utility of DeepER-Med through eight real-world clinical cases. Human clinician assessment indicates that DeepER-Med's conclusions align with clinical recommendations in seven cases, highlighting its potential for medical research and decision support.

摘要:信任度和透明度對於人工智慧 (AI) 在醫療保健和生物醫學研究中的臨床應用至關重要。最近的深度研究系統旨在通過將 AI 代理與多跳信息檢索、推理和綜合整合,來加速基於證據的科學發現。然而,大多數現有系統缺乏明確且可檢查的證據評估標準,這增加了錯誤累積的風險,並使研究人員和臨床醫生難以評估其輸出的可靠性。與此同時,當前的基準測試方法很少評估在複雜的現實醫療問題上的表現。在此,我們介紹 DeepER-Med,一個針對醫學的深度基於證據的研究框架,配備了一個代理 AI 系統。DeepER-Med 將深度醫學研究框架化為一個明確且可檢查的基於證據的生成工作流程,包含三個模塊:研究規劃、代理協作和證據綜合。為了支持現實評估,我們還提出了 DeepER-MedQA,一個基於證據的數據集,包含 100 個專家級研究問題,這些問題源自真實的醫學研究場景,並由 11 位生物醫學專家組成的多學科小組進行策劃。專家手動評估顯示,DeepER-Med 在多個標準上始終優於廣泛使用的生產級平台,包括生成新穎的科學見解。我們進一步通過八個現實臨床案例展示 DeepER-Med 的實用性。人類臨床醫生的評估表明,DeepER-Med 的結論在七個案例中與臨床建議一致,突顯了其在醫學研究和決策支持中的潛力。

RadAgent: A tool-using AI agent for stepwise interpretation of chest computed tomography

2604.15231v1 by Mélanie Roschewitz, Kenneth Styppa, Yitian Tao, Jiwoong Sohn, Jean-Benoit Delbrouck, Benjamin Gundersen, Nicolas Deperrois, Christian Bluethgen, Julia Vogt, Bjoern Menze, Farhad Nooralahzadeh, Michael Krauthammer, Michael Moor

Vision-language models (VLM) have markedly advanced AI-driven interpretation and reporting of complex medical imaging, such as computed tomography (CT). Yet, existing methods largely relegate clinicians to passive observers of final outputs, offering no interpretable reasoning trace for them to inspect, validate, or refine. To address this, we introduce RadAgent, a tool-using AI agent that generates CT reports through a stepwise and interpretable process. Each resulting report is accompanied by a fully inspectable trace of intermediate decisions and tool interactions, allowing clinicians to examine how the reported findings are derived. In our experiments, we observe that RadAgent improves Chest CT report generation over its 3D VLM counterpart, CT-Chat, across three dimensions. Clinical accuracy improves by 6.0 points (36.4% relative) in macro-F1 and 5.4 points (19.6% relative) in micro-F1. Robustness under adversarial conditions improves by 24.7 points (41.9% relative). Furthermore, RadAgent achieves 37.0% in faithfulness, a new capability entirely absent in its 3D VLM counterpart. By structuring the interpretation of chest CT as an explicit, tool-augmented and iterative reasoning trace, RadAgent brings us closer toward transparent and reliable AI for radiology.

摘要:視覺語言模型(VLM)顯著推進了基於人工智慧的複雜醫學影像解釋和報告,例如電腦斷層掃描(CT)。然而,現有的方法在很大程度上使臨床醫生成為最終輸出的被動觀察者,並未提供可解釋的推理痕跡供他們檢查、驗證或改進。為了解決這個問題,我們引入了 RadAgent,一個使用工具的人工智慧代理,通過逐步且可解釋的過程生成 CT 報告。每份生成的報告都附有可完全檢查的中間決策和工具互動的痕跡,允許臨床醫生檢查報告結果的推導過程。在我們的實驗中,我們觀察到 RadAgent 在三個維度上改善了胸部 CT 報告的生成,相較於其 3D VLM 版本 CT-Chat。臨床準確性在宏觀 F1 上改善了 6.0 分(相對 36.4%),在微觀 F1 上改善了 5.4 分(相對 19.6%)。在對抗條件下的穩健性改善了 24.7 分(相對 41.9%)。此外,RadAgent 在忠實度上達到了 37.0%,這是其 3D VLM 對應版本完全缺乏的新能力。通過將胸部 CT 的解釋結構化為一個明確的、增強工具的和迭代的推理痕跡,RadAgent 使我們更接近於實現放射學的透明和可靠的人工智慧。

Expert-Annotated Embryo Image Dataset with Natural Language Descriptions for Evidence-Based Patient Communication in IVF

2604.16528v1 by Nicklas Neu, Thomas Ebner, Jasmin Primus, Bernhard Schenkenfelder, Raphael Zefferer, Mathias Brunbauer, Florian Kromp

Embryo selection is one of multiple crucial steps in in-vitro fertilization, commonly based on morphological assessment by clinical embryologists. Although artificial intelligence methods have demonstrated their potential to support embryo selection by automated embryo ranking or grading methods, the overall impact of AI-based solutions is still limited. This is mainly due to the required adaptation of automated solutions to custom clinical data, reliance on time lapse incubators and a lack of interpretability to understand AI reasoning. The modern, informed patient is questioning expert decisions, particularly if the treatment is not successful. Thus, evidence-based decision justification in tasks like embryo selection would support transparent decision making and respectful patient communication. To support this aim, we hereby present an expert-annotated dataset consisting of embryo images and corresponding morphological description using natural language. The description contains relevant information on embryonic cell cycle, developmental stage and morphological features. This dataset enables the finetuning of modern foundational vision-language models to learn and improve over time with high accuracy. Predicted embryo descriptions can then be leveraged to automatically extract scientific evidence from literature, facilitating well-informed, evidence-based decision-making and transparent communication with patients. Our proposed dataset supports research in language-based, interpretable, and transparent automated embryo assessment and has the potential to enhance the decision-making process and improve patient outcomes significantly over time.

摘要:胚胎選擇是體外受精中多個關鍵步驟之一,通常基於臨床胚胎學家的形態評估。儘管人工智慧方法已顯示出支持胚胎選擇的潛力,例如自動化的胚胎排名或分級方法,但基於AI的解決方案的整體影響仍然有限。這主要是由於自動化解決方案需要適應特定的臨床數據,依賴於時間延遲培養箱,以及缺乏可解釋性來理解AI的推理。現代的知情患者質疑專家的決策,特別是在治療不成功的情況下。因此,在胚胎選擇等任務中進行基於證據的決策辯護將有助於透明的決策過程和尊重的患者溝通。為了支持這一目標,我們在此提出一個專家標註的數據集,該數據集包含胚胎圖像和相應的自然語言形態描述。描述中包含有關胚胎細胞週期、發育階段和形態特徵的相關信息。這個數據集使得現代基礎視覺-語言模型能夠進行微調,隨著時間的推移學習和提高準確性。預測的胚胎描述可以用來自動提取文獻中的科學證據,促進充分知情的基於證據的決策制定以及與患者的透明溝通。我們提出的數據集支持基於語言的、可解釋的和透明的自動化胚胎評估研究,並有潛力顯著增強決策過程並改善患者結果。

Agentic Explainability at Scale: Between Corporate Fears and XAI Needs

2604.14984v1 by Yomna Elsayed, Cecily Jones

As companies enter the race for agentic AI adoption, fears surface around agentic autonomy and its subsequent risks. These fears compound as companies scale their agentic AI adoption with low-code applications, without a comparable scaling in their governance processes and expertise resulting in a phenomenon known as "Agent Sprawl". While shadow AI tools can help with agentic discovery and identification, few observability tools offer insights into the agents' configuration and settings or the decision-making process during agent-to-agent communication and orchestration. This paper explores AI governance professionals' concerns in enterprise settings, while offering design-time and runtime explainability techniques as suggested by AI governance experts for addressing those fears. Finally, we provide a preliminary prototype of an Agentic AI Card that can help companies feel at ease deploying agents at scale.

摘要:隨著公司進入代理 AI 採用的競賽,對於代理自主性及其隨之而來的風險的擔憂浮現。隨著公司在低代碼應用上擴大其代理 AI 的採用,這些擔憂不斷加劇,而其治理流程和專業知識卻未能相應擴展,導致了一種現象稱為「代理擴散」。雖然影子 AI 工具可以幫助進行代理的發現和識別,但很少有可觀察性工具提供有關代理配置和設置或代理之間通信和協調過程中的決策過程的洞見。本文探討了企業環境中 AI 治理專業人士的擔憂,同時提供了 AI 治理專家建議的設計時和運行時可解釋性技術,以應對這些擔憂。最後,我們提供了一個初步原型的代理 AI 卡,可以幫助公司在大規模部署代理時感到安心。

Hybrid Decision Making via Conformal VLM-generated Guidance

2604.14980v2 by Debodeep Banerjee, Burcu Sayin, Stefano Teso, Andrea Passerini

Building on recent advances in AI, hybrid decision making (HDM) holds the promise of improving human decision quality and reducing cognitive load. We work in the context of learning to guide (LtG), a recently proposed HDM framework in which the human is always responsible for the final decision: rather than suggesting decisions, in LtG the AI supplies (textual) guidance useful for facilitating decision making. One limiting factor of existing approaches is that their guidance compounds information about all possible outcomes, and as a result it can be difficult to digest. We address this issue by introducing ConfGuide, a novel LtG approach that generates more succinct and targeted guidance. To this end, it employs conformal risk control to select a set of outcomes, ensuring a cap on the false negative rate. We demonstrate our approach on a real-world multi-label medical diagnosis task. Our empirical evaluation highlights the promise of ConfGuide.

摘要:基於近期在人工智慧方面的進展,混合決策(HDM)有望改善人類的決策質量並減少認知負擔。我們在學習引導(LtG)的背景下工作,這是一個最近提出的HDM框架,其中人類始終負責最終決策:在LtG中,AI提供有助於促進決策的(文本)指導,而不是建議決策。現有方法的一個限制因素是,它們的指導綜合了所有可能結果的信息,因此可能難以消化。我們通過引入ConfGuide來解決這個問題,這是一種新穎的LtG方法,能夠生成更簡潔和有針對性的指導。為此,它採用符合風險控制來選擇一組結果,確保假陰性率的上限。我們在一個現實世界的多標籤醫療診斷任務上展示了我們的方法。我們的實證評估突顯了ConfGuide的潛力。

Can LLMs Score Medical Diagnoses and Clinical Reasoning as well as Expert Panels?

2604.14892v2 by Amy Rouillard, Sitwala Mundia, Linda Camara, Michael Cameron Gramanie, Ziyaad Dangor, Ismail Kalla, Shabir A. Madhi, Kajal Morar, Marlvin T. Ncube, Haroon Saloojee, Bruce A. Bassett

Evaluating medical AI systems using expert clinician panels is costly and slow, motivating the use of large language models (LLMs) as alternative adjudicators. Here, we evaluate an LLM jury composed of three frontier AI models scoring 3333 diagnoses on 300 real-world middle-income country (MIC) hospital cases. Model performance was benchmarked against expert clinician panel and independent human re-scoring panel evaluations. Both LLM and clinician-generated diagnoses are scored across four dimensions: diagnosis, differential diagnosis, clinical reasoning and negative treatment risk. For each of these, we assess scoring difference, inter-rater agreement, scoring stability, severe safety errors and the effect of post-hoc calibration. We find that: (i) the uncalibrated LLM jury scores are systematically lower than clinician panels scores; (ii) the LLM Jury preserves ordinal agreement and exhibits better concordance with the primary expert panels than the human expert re-score panels do; (iii) the probability of severe errors is lower in \lj models compared to the human expert re-score panels; (iv) the LLM Jury shows excellent agreement with primary expert panels' rankings. We find that the LLM jury combined with AI model diagnoses can be used to identify ward diagnoses at high risk of error, enabling targeted expert review and improved panel efficiency; (v) LLM jury models show no self-preference bias. They did not score diagnoses generated by their own underlying model or models from the same vendor more (or less) favourably than those generated by other models. Finally, we demonstrate that LLM jury calibration using isotonic regression improves alignment with human expert panel evaluations. Together, these results provide compelling evidence that a calibrated, multi-model LLM jury can serve as a trustworthy and reliable proxy for expert clinician evaluation in medical AI benchmarking.

摘要:評估醫療 AI 系統使用專家臨床醫師小組既昂貴又緩慢,促使使用大型語言模型(LLMs)作為替代裁定者。在這裡,我們評估由三個前沿 AI 模型組成的 LLM 陪審團,對 300 個中等收入國家(MIC)醫院案例中的 3333 個診斷進行評分。模型性能與專家臨床醫師小組和獨立人類重新評分小組的評估進行基準比較。LLM 和臨床醫師生成的診斷在四個維度上進行評分:診斷、鑑別診斷、臨床推理和負面治療風險。對於這些,我們評估評分差異、評分者間一致性、評分穩定性、嚴重安全錯誤以及事後校準的效果。我們發現:(i)未經校準的 LLM 陪審團評分系統性地低於臨床醫師小組的評分;(ii)LLM 陪審團保持了序數一致性,並且與主要專家小組的符合度優於人類專家重新評分小組;(iii)與人類專家重新評分小組相比,\lj 模型中嚴重錯誤的概率較低;(iv)LLM 陪審團與主要專家小組的排名顯示出極好的一致性。我們發現,結合 AI 模型診斷的 LLM 陪審團可以用來識別高風險錯誤的病房診斷,從而實現針對性的專家審查和提高小組效率;(v)LLM 陪審團模型沒有自我偏好偏見。它們對自己底層模型或同一供應商的模型生成的診斷的評分並不比其他模型生成的診斷更(或更少)有利。最後,我們證明使用等距回歸進行 LLM 陪審團校準可以改善與人類專家小組評估的一致性。綜合這些結果,提供了有力的證據,表明經過校準的多模型 LLM 陪審團可以作為醫療 AI 基準中專家臨床評估的可靠代理。

M2-PALE: A Framework for Explaining Multi-Agent MCTS--Minimax Hybrids via Process Mining and LLMs

2604.14687v1 by Yiyu Qian, Liyuan Zhao, Tim Miller

Monte-Carlo Tree Search (MCTS) is a fundamental sampling-based search algorithm widely used for online planning in sequential decision-making domains. Despite its success in driving recent advances in artificial intelligence, understanding the behavior of MCTS agents remains a challenge for both developers and users. This difficulty stems from the complex search trees produced through the simulation of numerous future states and their intricate relationships. A known weakness of standard MCTS is its reliance on highly selective tree construction, which may lead to the omission of crucial moves and a vulnerability to tactical traps. To resolve this, we incorporate shallow, full-width Minimax search into the rollout phase of multi-agent MCTS to enhance strategic depth. Furthermore, to demystify the resulting decision-making logic, we introduce \textsf{M2-PALE} (MCTS--Minimax Process-Aided Linguistic Explanations). This framework employs process mining techniques, specifically the Alpha Miner, iDHM, and Inductive Miner algorithms, to extract underlying behavioral workflows from agent execution traces. These process models are then synthesized by LLMs to generate human-readable causal and distal explanations. We demonstrate the efficacy of our approach in a small-scale checkers environment, establishing a scalable foundation for interpreting hybrid agents in increasingly complex strategic domains.

摘要:蒙地卡羅樹搜尋(MCTS)是一種基於取樣的基本搜尋演算法,廣泛應用於序列決策領域的在線規劃。儘管它在推動人工智慧的最新進展方面取得了成功,但理解MCTS代理的行為對開發者和用戶來說仍然是一個挑戰。這種困難源於通過模擬大量未來狀態及其複雜關係所產生的複雜搜尋樹。標準MCTS的一個已知弱點是其依賴於高度選擇性的樹構建,這可能導致關鍵步驟的遺漏以及對戰術陷阱的脆弱性。為了解決這個問題,我們將淺層的全寬Minimax搜尋納入多代理MCTS的展開階段,以增強戰略深度。此外,為了揭示結果決策邏輯,我們引入了\textsf{M2-PALE}(MCTS--Minimax過程輔助語言解釋)。該框架採用了過程挖掘技術,特別是Alpha Miner、iDHM和Inductive Miner演算法,以從代理執行痕跡中提取潛在的行為工作流程。然後,這些過程模型由大型語言模型(LLMs)合成,以生成易於人類理解的因果和遠因解釋。我們在小型跳棋環境中展示了我們方法的有效性,為在日益複雜的戰略領域中解釋混合代理建立了一個可擴展的基礎。

Analyzing Chain of Thought (CoT) Approaches in Control Flow Code Deobfuscation Tasks

2604.15390v2 by Seyedreza Mohseni, Sarvesh Baskar, Edward Raff, Manas Gaur

Code deobfuscation is the task of recovering a readable version of a program while preserving its original behavior. In practice, this often requires days or even months of manual work with complex and expensive analysis tools. In this paper, we explore an alternative approach based on Chain-of-Thought (CoT) prompting, where a large language model is guided through explicit, step-by-step reasoning tailored for code analysis. We focus on control flow obfuscation, including Control Flow Flattening (CFF), Opaque Predicates, and their combination, and we measure both structural recovery of the control flow graph and preservation of program semantics. We evaluate five state-of-the-art large language models and show that CoT prompting significantly improves deobfuscation quality compared with simple prompting. We validate our approach on a diverse set of standard C benchmarks and report results using both structural metrics for control flow graphs and semantic metrics based on output similarity. Among the tested models and by applying CoT, GPT5 achieves the strongest overall performance, with an average gain of about 16% in control-flow graph reconstruction and about 20.5% in semantic preservation across our benchmarks compared to zero-shot prompting. Our results also show that model performance depends not only on the obfuscation level and the chosen obfuscator but also on the intrinsic complexity of the original control flow graph. Collectively, these findings suggest that CoT-guided large language models can serve as effective assistants for code deobfuscation, providing improved code explainability, more faithful control flow graph reconstruction, and better preservation of program behavior while potentially reducing the manual effort needed for reverse engineering.

摘要:代碼去混淆是恢復程式可讀版本的任務,同時保持其原始行為。在實踐中,這通常需要數天甚至數月的手動工作,並使用複雜且昂貴的分析工具。在本文中,我們探討了一種基於思維鏈(CoT)提示的替代方法,其中大型語言模型通過明確的逐步推理來進行代碼分析。我們專注於控制流混淆,包括控制流扁平化(CFF)、不透明謂詞及其組合,並測量控制流圖的結構恢復和程式語義的保留。我們評估了五個最先進的大型語言模型,並顯示CoT提示相比簡單提示顯著提高了去混淆的質量。我們在一組多樣的標準C基準上驗證了我們的方法,並報告了使用控制流圖的結構指標和基於輸出相似性的語義指標的結果。在測試的模型中,應用CoT的GPT5在整體性能上表現最佳,在控制流圖重建方面平均增益約為16%,在我們的基準中語義保留方面約為20.5%,相比於零樣本提示。我們的結果還顯示,模型性能不僅取決於混淆程度和所選擇的混淆器,還取決於原始控制流圖的內在複雜性。總體而言,這些發現表明,CoT引導的大型語言模型可以作為代碼去混淆的有效助手,提供改進的代碼可解釋性、更忠實的控制流圖重建以及更好的程式行為保留,同時可能減少反向工程所需的手動工作。

Rethinking Patient Education as Multi-turn Multi-modal Interaction

2604.14656v1 by Zonghai Yao, Zhipeng Tang, Chengtao Lin, Xiong Luo, Benlu Wang, Juncheng Huang, Chin Siang Ong, Hong Yu

Most medical multimodal benchmarks focus on static tasks such as image question answering, report generation, and plain-language rewriting. Patient education is more demanding: systems must identify relevant evidence across images, show patients where to look, explain findings in accessible language, and handle confusion or distress. Yet most patient education work remains text-only, even though combined image-and-text explanations may better support understanding. We introduce MedImageEdu, a benchmark for multi-turn, evidence-grounded radiology patient education. Each case provides a radiology report with report text and case images. A DoctorAgent interacts with a PatientAgent, conditioned on a hidden profile that captures factors such as education level, health literacy, and personality. When a patient question would benefit from visual support, the DoctorAgent can issue drawing instructions grounded in the report, case images, and the current question to a benchmark-provided drawing tool. The tool returns image(s), after which the DoctorAgent produces a final multimodal response consisting of the image(s) and a grounded plain-language explanation. MedImageEdu contains 150 cases from three sources and evaluates both the consultation process and the final multimodal response along five dimensions: Consultation, Safety and Scope, Language Quality, Drawing Quality, and Image-Text Response Quality. Across representative open- and closed-source vision-language model agents, we find three consistent gaps: fluent language often outpaces faithful visual grounding, safety is the weakest dimension across disease categories, and emotionally tense interactions are harder than low education or low health literacy. MedImageEdu provides a controlled testbed for assessing whether multimodal agents can teach from evidence rather than merely answer from text.

摘要:大多數醫療多模態基準專注於靜態任務,例如影像問答、報告生成和通俗語言重寫。病人教育的要求更高:系統必須在影像中識別相關證據,告訴病人該看哪裡,以易於理解的語言解釋發現,並處理困惑或焦慮。然而,大多數病人教育的工作仍然是純文本的,即使結合影像和文本的解釋可能更能支持理解。我們介紹了 MedImageEdu,一個針對多輪、基於證據的放射科病人教育的基準。每個案例提供一份放射科報告,包括報告文本和案例影像。一個 DoctorAgent 與一個 PatientAgent 互動,根據一個隱藏的個人資料進行調整,該資料捕捉了教育水平、健康素養和個性等因素。當病人的問題需要視覺支持時,DoctorAgent 可以根據報告、案例影像和當前問題向基準提供的繪圖工具發出繪圖指令。該工具返回影像,之後 DoctorAgent 產出最終的多模態回應,該回應包括影像和基於證據的通俗語言解釋。MedImageEdu 包含來自三個來源的 150 個案例,並沿著五個維度評估諮詢過程和最終的多模態回應:諮詢、安全性和範圍、語言質量、繪圖質量和影像-文本回應質量。在代表性的開源和閉源視覺-語言模型代理中,我們發現三個一致的差距:流利的語言往往超越忠實的視覺基礎,安全性是各疾病類別中最薄弱的維度,而情感緊張的互動比低教育或低健康素養的互動更困難。MedImageEdu 提供了一個受控的測試平台,用於評估多模態代理是否能根據證據進行教學,而不僅僅是從文本中回答問題。

CoDaS: AI Co-Data-Scientist for Biomarker Discovery via Wearable Sensors

2604.14615v1 by Yubin Kim, Salman Rahman, Samuel Schmidgall, Chunjong Park, A. Ali Heydari, Ahmed A. Metwally, Hong Yu, Xin Liu, Xuhai Xu, Yuzhe Yang, Maxwell A. Xu, Zhihan Zhang, Cynthia Breazeal, Tim Althoff, Petar Sirkovic, Ivor Rendulic, Annalisa Pawlosky, Nicolas Stroppa, Juraj Gottweis, Elahe Vedadi, Alan Karthikesalingam, Pushmeet Kohli, Vivek Natarajan, Mark Malhotra, Shwetak Patel, Hae Won Park, Hamid Palangi, Daniel McDuff

Scientific discovery in digital health requires converting continuous physiological signals from wearable devices into clinically actionable biomarkers. We introduce CoDaS (AI Co-Data-Scientist), a multi-agent system that structures biomarker discovery as an iterative process combining hypothesis generation, statistical analysis, adversarial validation, and literature-grounded reasoning with human oversight using large-scale wearable datasets. Across three cohorts totaling 9,279 participant-observations, CoDaS identified 41 candidate digital biomarkers for mental health and 25 for metabolic outcomes, each subjected to an internal validation battery spanning replication, stability, robustness, and discriminative power. Across two independent depression cohorts, CoDaS surfaced circadian instability-related features in both datasets, reflected in sleep duration variability (DWB, ρ= 0.252, p < 0.001) and sleep onset variability (GLOBEM, ρ= 0.126, p < 0.001). In a metabolic cohort, CoDaS derived a cardiovascular fitness index (steps/resting heart rate; ρ= -0.374, p < 0.001), and recovered established clinical associations, including the hepatic function ratio (AST/ALT; ρ= -0.375, p < 0.001), a known correlate of insulin resistance. Incorporating CoDaS-derived features alongside demographic variables led to modest but consistent improvements in predictive performance, with cross-validated ΔR^2 increases of 0.040 for depression and 0.021 for insulin resistance. These findings suggest that CoDaS enables systematic and traceable hypothesis generation and prioritization for biomarker discovery from large-scale wearable data.

摘要:科學發現數位健康需要將可穿戴設備的連續生理信號轉換為臨床可行的生物標記。我們介紹了 CoDaS(AI Co-Data-Scientist),這是一個多代理系統,將生物標記的發現結構化為一個迭代過程,結合假設生成、統計分析、對抗驗證和基於文獻的推理,並在大型可穿戴數據集的人工監督下進行。 在三個總計 9,279 名參與者觀察的隊列中,CoDaS 識別出 41 個候選數位生物標記用於心理健康,和 25 個用於代謝結果,每個標記都經過內部驗證電池的檢驗,包括重複性、穩定性、穩健性和區分能力。 在兩個獨立的抑鬱症隊列中,CoDaS 在兩個數據集中都顯示出與生理節律不穩定性相關的特徵,這在睡眠持續時間變異性(DWB,ρ= 0.252,p < 0.001)和入睡變異性(GLOBEM,ρ= 0.126,p < 0.001)中得到了反映。 在一個代謝隊列中,CoDaS 推導出一個心血管健康指數(步數/靜息心率;ρ= -0.374,p < 0.001),並恢復了已建立的臨床關聯,包括肝功能比率(AST/ALT;ρ= -0.375,p < 0.001),這是胰島素抵抗的已知相關指標。 將 CoDaS 推導的特徵與人口統計變量結合,導致預測性能的適度但一致的改善,抑鬱症的交叉驗證 ΔR^2 增加了 0.040,胰島素抵抗的 ΔR^2 增加了 0.021。 這些發現表明,CoDaS 能夠從大型可穿戴數據中系統性且可追溯地生成和優先考慮假設,以進行生物標記的發現。

Generative Augmented Inference

2604.14575v1 by Cheng Lu, Mengxin Wang, Dennis J. Zhang, Heng Zhang

Data-driven operations management often relies on parameters estimated from costly human-generated labels. Recent advances in large language models (LLMs) and other AI systems offer inexpensive auxiliary data, but introduce a new challenge: AI outputs are not direct observations of the target outcomes, but could involve high-dimensional representations with complex and unknown relationships to human labels. Conventional methods leverage AI predictions as direct proxies for true labels, which can be inefficient or unreliable when this relationship is weak or misspecified. We propose Generative Augmented Inference (GAI), a general framework that incorporates AI-generated outputs as informative features for estimating models of human-labeled outcomes. GAI uses an orthogonal moment construction that enables consistent estimation and valid inference with flexible, nonparametric relationship between LLM-generated outputs and human labels. We establish asymptotic normality and show a "safe default" property: relative to human-data-only estimators, GAI weakly improves estimation efficiency under arbitrary auxiliary signals and yields strict gains whenever the auxiliary information is predictive. Empirically, GAI outperforms benchmarks across diverse settings. In conjoint analysis with weak auxiliary signals, GAI reduces estimation error by about 50% and lowers human labeling requirements by over 75%. In retail pricing, where all methods access the same auxiliary inputs, GAI consistently outperforms alternative estimators, highlighting the value of its construction rather than differences in information. In health insurance choice, it cuts labeling requirements by over 90% while maintaining decision accuracy. Across applications, GAI improves confidence interval coverage without inflating width. Overall, GAI provides a principled and scalable approach to integrating AI-generated information.

摘要:數據驅動的運營管理通常依賴於從昂貴的人類生成標籤中估算的參數。最近在大型語言模型(LLMs)和其他人工智慧系統方面的進展提供了廉價的輔助數據,但也帶來了一個新的挑戰:人工智慧的輸出並不是目標結果的直接觀察,而可能涉及與人類標籤之間複雜且未知關係的高維表示。傳統方法將人工智慧的預測視為真實標籤的直接代理,當這種關係較弱或錯誤指定時,這可能是低效或不可靠的。我們提出生成增強推斷(GAI),這是一個通用框架,將人工智慧生成的輸出作為估算人類標記結果的有用特徵。GAI使用正交矩量構造,使得在LLM生成的輸出和人類標籤之間的靈活非參數關係下,能夠進行一致的估算和有效的推斷。我們建立了漸近正態性,並展示了一個“安全默認”屬性:相對於僅依賴人類數據的估算器,GAI在任意輔助信號下弱化了估算效率,並在輔助信息具有預測性時獲得了嚴格的增益。在實證中,GAI在各種設置中表現優於基準。在輔助信號較弱的聯合分析中,GAI將估算誤差降低約50%,並將人類標記需求降低超過75%。在零售定價中,所有方法都使用相同的輔助輸入,GAI始終優於其他估算器,突顯了其構造的價值,而非信息的差異。在健康保險選擇中,它將標記需求降低超過90%,同時保持決策準確性。在各種應用中,GAI提高了置信區間的覆蓋率,而不會擴大寬度。總體而言,GAI提供了一種原則性和可擴展的方法來整合人工智慧生成的信息。

Perspective on Bias in Biomedical AI: Preventing Downstream Healthcare Disparities

2604.14514v1 by Michal Rosen-Zvi, Yoav Kan-Tor, Michael Danziger, Agata Ferretti, Javier Aula-Blasco, Julia Falcao, Ron Shamir, Mordechai Muszkat

Healthcare disparities persist across socioeconomic boundaries, often attributed to unequal access to screening, diagnostics, and therapeutics. However, this perspective highlights that critical biases can emerge much earlier, during data collection and research prioritization, long before clinical implementation in cases where the focus of the studies and the data that is collected is at the molecular level. A vast number of studies focus on collecting omics data but the demographic information associated with these datasets is often not reported in the studies, and when it is reported, it shows big biases. An automated analysis of 4719 PubMed-indexed omics publications from 2015 to 2024 reveals that only a small fraction report ancestry or ethnicity information, with ancestry reporting improving slightly. Analysis of large-scale datasets commonly used for model training, such as CellxGene and GEO, reveals substantial population bias where European-ancestry data dominates. As biomedical foundation models become central to biomedical discovery with a paradigm in which base models are pretrained on large datasets and reusing them time and again for many different downstream tasks, they risk perpetuating or amplifying these early-stage biases, leading to cascading inequities that regulatory interventions cannot fully reverse. We propose a community-wide focus on three foundational principles: Provenance, Openness, and Evaluation Transparency to improve equity and robustness in biomedical AI. This approach aims to foster biomedical innovation that more effectively serves underserved populations and improves health outcomes.

摘要:醫療差異在社會經濟邊界中持續存在,通常歸因於對篩檢、診斷和治療的不平等獲取。然而,這種觀點突顯出,在數據收集和研究優先順序的早期階段,可能會出現關鍵的偏見,這在臨床實施之前就已發生,特別是在研究的重點和所收集的數據位於分子層面時。大量研究專注於收集組學數據,但與這些數據集相關的人口統計信息在研究中往往未被報告,而當報告時,則顯示出明顯的偏見。對2015年至2024年間4719篇PubMed索引的組學出版物進行的自動分析顯示,只有一小部分報告了祖先或種族信息,而祖先報告略有改善。對於常用於模型訓練的大規模數據集(如CellxGene和GEO)的分析顯示,存在顯著的人口偏見,其中歐洲祖先數據占主導地位。隨著生物醫學基礎模型在生物醫學發現中變得越來越重要,這種範式中基礎模型是在大型數據集上預訓練的,並在許多不同的下游任務中反覆使用,它們有可能延續或放大這些早期階段的偏見,導致連鎖不平等,這是監管干預無法完全逆轉的。我們提議在社區範圍內專注於三個基本原則:來源、開放性和評估透明度,以改善生物醫學AI中的公平性和穩健性。這種方法旨在促進生物醫學創新,更有效地服務於服務不足的人群並改善健康結果。

When PCOS Meets Eating Disorders: An Explainable AI Approach to Detecting the Hidden Triple Burden

2604.14356v1 by Apoorv Prasad, Susan McRoy

Women with polycystic ovary syndrome (PCOS) face substantially elevated risks of body image distress, disordered eating, and metabolic challenges, yet existing natural language processing approaches for detecting these conditions lack transparency and cannot identify co-occurring presentations. We developed small, open-source language models to automatically detect this triple burden in social media posts with grounded explainability. We collected 1,000 PCOS-related posts from six subreddits, with two trained annotators labeling posts using guidelines operationalizing Lee et al. (2017) clinical framework. Three models (Gemma-2-2B, Qwen3-1.7B, DeepSeek-R1-Distill-Qwen-1.5B) were fine-tuned using Low-Rank Adaptation to generate structured explanations with textual evidence. The best model achieved 75.3 percent exact match accuracy on 150 held-out posts, with robust comorbidity detection and strong explainability. Performance declined with diagnostic complexity, indicating their best use is for screening rather than autonomous diagnosis.

摘要:女性患有多囊卵巢綜合症(PCOS)面臨著身體形象困擾、飲食失調和代謝挑戰的風險顯著增加,然而現有的自然語言處理方法在檢測這些情況時缺乏透明度,且無法識別共病表現。我們開發了小型的開源語言模型,以自動檢測社交媒體帖子中的這三重負擔,並提供基於證據的解釋。我們從六個子版塊收集了1,000條與PCOS相關的帖子,兩名訓練過的標註者根據Lee等人(2017)的臨床框架進行標註。三個模型(Gemma-2-2B、Qwen3-1.7B、DeepSeek-R1-Distill-Qwen-1.5B)使用低秩適應進行微調,以生成帶有文本證據的結構化解釋。最佳模型在150條保留的帖子上達到了75.3%的精確匹配準確率,具有穩健的共病檢測和強大的解釋能力。隨著診斷複雜性的增加,性能有所下降,這表明它們的最佳使用是用於篩查,而非自主診斷。

Faithfulness Serum: Mitigating the Faithfulness Gap in Textual Explanations of LLM Decisions via Attribution Guidance

2604.14325v1 by Bar Alon, Itamar Zimerman, Lior Wolf

Large language models (LLMs) achieve strong performance and have revolutionized NLP, but their lack of explainability keeps them treated as black boxes, limiting their use in domains that demand transparency and trust. A promising direction to address this issue is post-hoc text-based explanations, which aim to explain model decisions in natural language. Prior work has focused on generating convincing rationales that appear to be subjectively faithful, but it remains unclear whether these explanations are epistemically faithful, whether they reflect the internal evidence the model actually relied on for its decision. In this paper, we first assess the epistemic faithfulness of LLM-generated explanations via counterfactuals and show that they are often unfaithful. We then introduce a training-free method that enhances faithfulness by guiding explanation generation through attention-level interventions, informed by token-level heatmaps extracted via a faithful attribution method. This method significantly improves epistemic faithfulness across multiple models, benchmarks, and prompts.

摘要:大型語言模型(LLMs)表現出色,並且徹底改變了自然語言處理(NLP),但它們缺乏可解釋性,使得它們被視為黑箱,限制了它們在需要透明度和信任的領域中的應用。解決這個問題的一個有前景的方向是事後基於文本的解釋,旨在用自然語言解釋模型的決策。先前的研究集中於生成看似主觀上可信的說明,但仍不清楚這些解釋是否在認識上是可信的,是否反映了模型實際依賴的內部證據來做出決策。在本文中,我們首先通過反事實評估LLM生成的解釋的認識可信度,並表明它們通常是不可信的。然後,我們介紹了一種無需訓練的方法,通過注意力層級的干預來指導解釋生成,這些干預是基於通過可信歸因方法提取的標記級熱圖。這種方法顯著提高了多個模型、基準和提示的認識可信度。

Seeing Through Experts Eyes A Foundational Vision Language Model Trained on Radiologists Gaze and Reasoning

2604.14316v1 by Kinhei Lee, Peiyuan Jing, Zhenxuan Zhang, Yue Yang, Tao Wang, Dominic C Marshall, Yingying Fang, Guang Yang

Large scale vision language models have shown promise in automating chest Xray interpretation, yet their clinical utility remains limited by a gap between model outputs and radiologist reasoning. Most systems optimize for semantic information without emulating how experts visually examine medical images, often overlooking critical findings or diverging from established diagnostic workflows. Radiologists follow structured protocols (e.g., the ABCDEF approach) that ensure all clinically relevant regions are systematically examined, reducing missed findings and supporting reliable diagnostic reasoning. We introduce GazeX, a vision language model that leverages radiologists' eye tracking data as a behavioral prior to model expert diagnostic reasoning. By incorporating gaze trajectories and fixation patterns into pretraining, GazeX learns to follow the spatial and temporal structure of radiologist attention and integrates observations in a clinically meaningful sequence. Using a curated dataset of over 30,000 gaze key frames from five radiologists, we demonstrate that GazeX produces more accurate, interpretable, and expert consistent outputs across radiology report generation, disease grounding, and visual question answering, utilizing 231,835 radiographic studies, 780,014 question answer pairs, and 1,162 image sentence pairs with bounding boxes. Unlike autonomous reporting systems, GazeX produces verifiable evidence artifacts, including inspection trajectories and finding linked localized regions, enabling efficient human verification and safe human AI collaboration. Learning through expert eyes provides a practical route toward more trustworthy, explainable, and diagnostically robust AI systems for radiology and beyond.

摘要:大型視覺語言模型在自動化胸部X光解讀方面顯示出潛力,然而它們的臨床實用性仍受到模型輸出與放射科醫生推理之間差距的限制。大多數系統優化語義信息,而未能模擬專家如何視覺檢查醫學影像,常常忽略關鍵發現或偏離既定的診斷工作流程。放射科醫生遵循結構化的協議(例如,ABCDEF方法),確保所有臨床相關區域都被系統性地檢查,從而減少漏診並支持可靠的診斷推理。我們介紹GazeX,一種視覺語言模型,利用放射科醫生的眼動追蹤數據作為模型專家診斷推理的行為先驗。通過將注視軌跡和凝視模式納入預訓練,GazeX學會跟隨放射科醫生注意力的空間和時間結構,並以臨床意義的順序整合觀察結果。我們使用來自五位放射科醫生的超過30,000個凝視關鍵幀的精選數據集,展示GazeX在放射報告生成、疾病定位和視覺問答中產生更準確、可解釋且與專家一致的輸出,利用231,835個放射學研究、780,014個問題答案對和1,162個帶有邊界框的影像句子對。與自主報告系統不同,GazeX產生可驗證的證據工件,包括檢查軌跡和與局部區域相關的發現,從而實現高效的人類驗證和安全的人機協作。通過專家的目光學習為放射學及其他領域提供了一條實用的途徑,朝向更值得信賴、可解釋和診斷上更強健的AI系統。

EuropeMedQA Study Protocol: A Multilingual, Multimodal Medical Examination Dataset for Language Model Evaluation

2604.14306v2 by Francesco Andrea Causio, Vittorio De Vita, Olivia Riccomi, Michele Ferramola, Federico Felizzi, Alessandro Tosi, Antonio Cristiano, Lorenzo De Mori, Chiara Battipaglia, Melissa Sawaya, Luigi De Angelis, Marcello Di Pumpo, Alessandra Piscitelli, Pietro Eric Risuleo, Alessia Longo, Giulia Vojvodic, Mariapia Vassalli, Bianca Destro Castaniti, Nicolò Scarsi, Manuel Del Medico

While Large Language Models (LLMs) have demonstrated high proficiency on English-centric medical examinations, their performance often declines when faced with non-English languages and multimodal diagnostic tasks. This study protocol describes the development of EuropeMedQA, the first comprehensive, multilingual, and multimodal medical examination dataset sourced from official regulatory exams in Italy, France, Spain, and Portugal. Following FAIR data principles and SPIRIT-AI guidelines, we describe a rigorous curation process and an automated translation pipeline for comparative analysis. We evaluate contemporary multimodal LLMs using a zero-shot, strictly constrained prompting strategy to assess cross-lingual transfer and visual reasoning. EuropeMedQA aims to provide a contamination-resistant benchmark that reflects the complexity of European clinical practices and fosters the development of more generalizable medical AI.

摘要:雖然大型語言模型(LLMs)在以英語為中心的醫學考試中展現出高水平的能力,但在面對非英語語言和多模態診斷任務時,其表現往往會下降。這項研究計劃描述了EuropeMedQA的開發,這是第一個來自意大利、法國、西班牙和葡萄牙官方監管考試的綜合性、多語言和多模態醫學考試數據集。根據FAIR數據原則和SPIRIT-AI指導方針,我們描述了一個嚴謹的策展過程和一個自動翻譯管道,以進行比較分析。我們使用零樣本、嚴格限制的提示策略來評估當代多模態LLMs,以評估跨語言轉移和視覺推理。EuropeMedQA旨在提供一個抗污染的基準,反映歐洲臨床實踐的複雜性,並促進更具通用性的醫療AI的發展。

Quantum-inspired tensor networks in machine learning models

2604.14287v1 by Guillermo Valverde, Igor García-Olaizola, Giannicola Scarpa, Alejandro Pozas-Kerstjens

Tensor networks were developed in the context of many-body physics as compressed representations of multiparticle quantum states. These representations mitigate the exponential complexity of many-body systems by capturing only the most relevant dependencies. Due to the formal similarity between quantum entanglement and statistical correlations, tensor networks have recently been integrated in machine learning, operating both as alternative learning architectures and as decompositions of components of neural networks. The expectation is that the theoretical understanding of tensor networks developed within quantum many-body physics leads to novel methods that offer advantages in terms of computational efficiency, explainability, or privacy. Here we review the use of tensor networks in the context of machine learning, providing a critical assessment of the state of the art, the potential advantages, and the challenges that must be overcome.

摘要:張量網絡是在多體物理的背景下發展起來的,作為多粒子量子態的壓縮表示。這些表示通過僅捕捉最相關的依賴性來減輕多體系統的指數複雜性。由於量子糾纏和統計相關性之間的形式相似性,張量網絡最近被整合進機器學習中,既作為替代學習架構,也作為神經網絡組件的分解。預期在量子多體物理中發展的張量網絡的理論理解將導致新穎的方法,提供計算效率、可解釋性或隱私方面的優勢。在這裡,我們回顧了張量網絡在機器學習背景下的應用,對當前的技術狀態、潛在優勢以及必須克服的挑戰進行了批判性評估。

Applied Explainability for Large Language Models: A Comparative Study

2604.15371v1 by Venkata Abhinandan Kancharla

Large language models (LLMs) achieve strong performance across many natural language processing tasks, yet their decision processes remain difficult to interpret. This lack of transparency creates challenges for trust, debugging, and deployment in real-world systems. This paper presents an applied comparative study of three explainability techniques: Integrated Gradients, Attention Rollout, and SHAP, on a fine-tuned DistilBERT model for SST-2 sentiment classification. Rather than proposing new methods, the focus is on evaluating the practical behavior of existing approaches under a consistent and reproducible setup. The results show that gradient-based attribution provides more stable and intuitive explanations, while attention-based methods are computationally efficient but less aligned with prediction-relevant features. Model-agnostic approaches offer flexibility but introduce higher computational cost and variability. This work highlights key trade-offs between explainability methods and emphasizes their role as diagnostic tools rather than definitive explanations. The findings provide practical insights for researchers and engineers working with transformer-based NLP systems. This is a preprint and has not undergone peer review.

摘要:大型語言模型(LLMs)在許多自然語言處理任務中表現出色,但它們的決策過程仍然難以解釋。這種缺乏透明度為信任、除錯和在實際系統中的部署帶來挑戰。
本文呈現了一項針對三種可解釋性技術的應用比較研究:整合梯度、注意力展開和SHAP,針對經過微調的DistilBERT模型進行SST-2情感分類。研究重點不在於提出新方法,而是在於評估現有方法在一致且可重複的設置下的實際行為。
結果顯示,基於梯度的歸因提供了更穩定和直觀的解釋,而基於注意力的方法計算效率高,但與預測相關特徵的對齊程度較低。模型無關的方法提供了靈活性,但引入了更高的計算成本和變異性。
這項工作突顯了可解釋性方法之間的關鍵權衡,並強調它們作為診斷工具而非確定性解釋的角色。研究結果為從事基於Transformer的NLP系統的研究人員和工程師提供了實用的見解。
這是一篇預印本,尚未經過同行評審。

Med-CAM: Minimal Evidence for Explaining Medical Decision Making

2604.13695v1 by Pirzada Suhail, Aditya Anand, Amit Sethi

Reliable and interpretable decision-making is essential in medical imaging, where diagnostic outcomes directly influence patient care. Despite advances in deep learning, most medical AI systems operate as opaque black boxes, providing little insight into why a particular diagnosis was reached. In this paper, we introduce Med-CAM, a framework for generating minimal and sharp maps as evidence-based explanations for Medical decision making via Classifier Activation Matching. Med-CAM trains a segmentation network from scratch to produce a mask that highlights the minimal evidence critical to model's decision for any seen or unseen image. This ensures that the explanation is both faithful to the network's behaviour and interpretable to clinicians. Experiments show, unlike prior spatial explanation methods, such as Grad-CAM and attention maps, which yield only fuzzy regions of relative importance, Med-CAM with its superior spatial awareness to shapes, textures, and boundaries, delivers conclusive, evidence-based explanations that faithfully replicate the model's prediction for any given image. By explicitly constraining explanations to be compact, consistent with model activations, and diagnostic alignment, Med-CAM advances transparent AI to foster clinician understanding and trust in high-stakes medical applications such as pathology and radiology.

摘要:可靠且可解釋的決策在醫學影像中至關重要,因為診斷結果直接影響病人護理。儘管深度學習取得了進展,大多數醫療AI系統仍然運作如同不透明的黑箱,幾乎無法提供為何達成特定診斷的見解。在本文中,我們介紹了Med-CAM,這是一個通過分類器激活匹配生成最小且清晰的地圖作為基於證據的醫療決策解釋的框架。Med-CAM從零開始訓練一個分割網絡,以產生一個掩模,突出顯示對於任何已見或未見影像模型決策至關重要的最小證據。這確保了解釋既忠實於網絡的行為,又能被臨床醫生解讀。實驗顯示,與之前的空間解釋方法(如Grad-CAM和注意力圖)不同,這些方法僅產生模糊的相對重要性區域,Med-CAM憑藉其對形狀、質地和邊界的卓越空間感知,提供了結論性、基於證據的解釋,忠實地複製了模型對於任何給定影像的預測。通過明確約束解釋為緊湊、與模型激活一致且診斷對齊,Med-CAM推進了透明AI,以促進臨床醫生在病理學和放射學等高風險醫療應用中的理解和信任。

Learning from Change: Predictive Models for Incident Prevention in a Regulated IT Environment

2604.13462v1 by Eileen Kapel, Jan Lennartz, Luis Cruz, Diomidis Spinellis, Arie van Deursen

Effective IT change management is important for businesses that depend on software and services, particularly in highly regulated sectors such as finance, where operational reliability, auditability, and explainability are essential. A significant portion of IT incidents are caused by changes, making it important to identify high-risk changes before deployment. This study presents a predictive incident risk scoring approach at a large international bank. The approach supports engineers during the assessment and planning phases of change deployments by predicting the potential of inducing incidents. To satisfy regulatory constraints, we built the model with auditability and explainability in mind, applying SHAP values to provide feature-level insights and ensure decisions are traceable and transparent. Using a one-year real-world dataset, we compare the existing rule-based process with three machine learning models: HGBC, LightGBM, and XGBoost. LightGBM achieved the best performance, particularly when enriched with aggregated team metrics that capture organisational context. Our results show that data-driven, interpretable models can outperform rule-based approaches while meeting compliance needs, enabling proactive risk mitigation and more reliable IT operations.

摘要:有效的IT變更管理對於依賴軟體和服務的企業至關重要,尤其是在金融等高度監管的行業中,運營可靠性、可審計性和可解釋性是必不可少的。相當一部分IT事件是由變更引起的,因此在部署之前識別高風險變更變得非常重要。本研究提出了一種在大型國際銀行中預測事件風險評分的方法。該方法在變更部署的評估和規劃階段支持工程師,通過預測引發事件的潛在可能性來提供幫助。為了滿足監管約束,我們在構建模型時考慮了可審計性和可解釋性,應用SHAP值提供特徵級別的見解,確保決策可追溯且透明。利用一年的真實數據集,我們將現有的基於規則的過程與三種機器學習模型進行比較:HGBC、LightGBM和XGBoost。LightGBM在性能上表現最佳,特別是在結合了捕捉組織背景的聚合團隊指標後。我們的結果顯示,數據驅動的可解釋模型可以超越基於規則的方法,同時滿足合規需求,實現主動風險緩解和更可靠的IT運營。

Interpretable and Explainable Surrogate Modeling for Simulations: A State-of-the-Art Survey and Perspectives on Explainable AI for Decision-Making

2604.14240v1 by Pramudita Satria Palar, Paul Saves, Muhammad Daffa Robani, Nicolas Verstaevel, Moncef Garouani, Julien Aligon, Koji Shimoyama, Joseph Morlier, Benoit Gaudou

The simulation of complex systems increasingly relies on sophisticated but fundamentally opaque computational black-box simulators. Surrogate models play a central role in reducing the computational cost of complex systems simulations across a wide range of scientific and engineering domains. Notwithstanding, they inevitably inherit and often exacerbate this black-box nature, obscuring how input variables drive physical responses. Conversely, Explainable Artificial Intelligence (XAI) offers powerful tools to unpack these models. Yet, XAI methods struggle with engineering-specific constraints, such as highly correlated inputs, dynamical systems, and rigorous reliability requirements. Consequently, surrogate modeling and XAI have largely evolved as distinct fields of research, despite their strong complementarity. To reconnect these approaches, this state-of-the-art survey provides a structured perspective that maps existing XAI techniques onto the various stages of surrogate modeling workflows for design and exploration. To ground this synthesis, we draw upon illustrative applications across both equation-based simulations and agent-based modeling. We survey a broad spectrum of techniques, highlighting their strengths for revealing interactions and supporting human comprehension. Finally, we identify pressing open challenges, including the explainability of dynamical systems and the handling of mixed-variable systems, and propose a research agenda to make explainability a core, embedded element of simulation-driven workflows from model construction through decision-making. By transforming opaque emulators into explainable tools, this agenda empowers practitioners to move beyond accelerating simulations to extracting actionable insights from complex system behaviors.

摘要:複雜系統的模擬越來越依賴於精密但基本上不透明的計算黑箱模擬器。替代模型在減少複雜系統模擬的計算成本方面扮演著核心角色,涵蓋了廣泛的科學和工程領域。儘管如此,它們不可避免地繼承並且往往加劇了這種黑箱特性,模糊了輸入變量如何驅動物理反應。相反,可解釋的人工智慧(XAI)提供了強大的工具來解釋這些模型。然而,XAI 方法在工程特定的約束方面面臨挑戰,例如高度相關的輸入、動態系統和嚴格的可靠性要求。因此,儘管替代建模和 XAI 之間有很強的互補性,但它們在很大程度上發展為不同的研究領域。為了重新連接這些方法,這篇最先進的調查提供了一個結構化的視角,將現有的 XAI 技術映射到替代建模工作流程的各個階段,以便進行設計和探索。為了使這一綜合分析更具實質性,我們借鑒了基於方程的模擬和基於代理的建模中的示例應用。我們調查了廣泛的技術,突顯它們在揭示交互和支持人類理解方面的優勢。最後,我們確定了一些緊迫的開放挑戰,包括動態系統的可解釋性和混合變量系統的處理,並提出了一個研究議程,旨在使可解釋性成為從模型構建到決策過程中模擬驅動工作流程的核心嵌入元素。通過將不透明的模擬器轉變為可解釋的工具,這一議程使從業者能夠超越加速模擬,從複雜系統行為中提取可行的見解。

ReSS: Learning Reasoning Models for Tabular Data Prediction via Symbolic Scaffold

2604.13392v1 by Chenlang Yi, Gang Li, Zizhan Xiong, Tue Minh Cao, Yanmin Gong, My T. Thai, Tianbao Yang

Tabular data remains prevalent in high-stakes domains such as healthcare and finance, where predictive models are expected to provide both high accuracy and faithful, human-understandable reasoning. While symbolic models offer verifiable logic, they lack semantic expressiveness. Meanwhile, general-purpose LLMs often require specialized fine-tuning to master domain-specific tabular reasoning. To address the dual challenges of scalable data curation and reasoning consistency, we propose ReSS, a systematic framework that bridges symbolic and neural reasoning models. ReSS leverages a decision-tree model to extract instance-level decision paths as symbolic scaffolds. These scaffolds, alongside input features and labels, guide an LLM to generate grounded natural-language reasoning that strictly adheres to the underlying decision logic. The resulting high-quality dataset is used to fine-tune a pretrained LLM into a specialized tabular reasoning model, further enhanced by a scaffold-invariant data augmentation strategy to improve generalization and explainability. To rigorously assess faithfulness, we introduce quantitative metrics including hallucination rate, explanation necessity, and explanation sufficiency. Experimental results on medical and financial benchmarks demonstrate that ReSS-trained models improve traditional decision trees and standard fine-tuning approaches up to $10\%$ while producing faithful and consistent reasoning

摘要:表格數據在醫療和金融等高風險領域仍然盛行,這些領域的預測模型被期望提供高準確性和可信的人類可理解推理。雖然符號模型提供可驗證的邏輯,但它們缺乏語義表達能力。與此同時,通用 LLM 通常需要專門的微調才能掌握特定領域的表格推理。為了解決可擴展數據策劃和推理一致性的雙重挑戰,我們提出了 ReSS,一個系統框架,橋接符號和神經推理模型。ReSS 利用決策樹模型提取實例級決策路徑作為符號支架。這些支架連同輸入特徵和標籤,引導 LLM 生成基於自然語言的推理,嚴格遵循基礎決策邏輯。所產生的高質量數據集用於微調預訓練的 LLM 成為專門的表格推理模型,並通過支架不變數據增強策略進一步提升,以改善泛化能力和可解釋性。為了嚴格評估忠實性,我們引入了包括幻覺率、解釋必要性和解釋充分性在內的定量指標。在醫療和金融基準上的實驗結果顯示,經 ReSS 訓練的模型在提高傳統決策樹和標準微調方法的表現上達到 $10\%$,同時產生可信且一致的推理。

Young people's perceptions and recommendations for conversational generative artificial intelligence in youth mental health

2604.13381v1 by Adam Poulsen, Ian B. Hickie, Carla Gorban, Zsofi de Haan, William Capon, Ebenezer Eyeson-Annan, Jalal Radwan, Elizabeth M. Scott, Frank Iorfino, Haley M. LaMonica

Conversational generative artificial intelligence agents (or genAI chatbots) could benefit youth mental health, yet young people's perspectives remain underexplored. We examined the Mental health Intelligence Agent (Mia), a genAI chatbot originally designed for professionals in Australian youth services. Following co-design, 32 young people participated in online workshops exploring their perceptions of genAI chatbots in youth mental health and to develop recommendations for reconceptualising Mia for consumers and integrating it into services. Four themes were developed: (1) Humanising AI without dehumanising care, (2) I need to know what's under the hood, (3) Right tool, right place, right time?, and (4) Making it mine on safe ground. This study offers insights into young people's attitudes, needs, and requirements regarding genAI chatbots in youth mental health, with key implications for service integration. Additionally, by co-designing system requirements, this work informs the ethics, design, development, implementation, and governance of genAI chatbots in youth mental health contexts.

摘要:對話生成型人工智慧代理(或稱 genAI 聊天機器人)可能有助於青少年的心理健康,但年輕人的觀點仍然未被充分探討。我們檢視了心理健康智慧代理(Mia),這是一個最初為澳大利亞青少年服務專業人士設計的 genAI 聊天機器人。在共同設計後,32 位年輕人參加了線上工作坊,探索他們對青少年心理健康中 genAI 聊天機器人的看法,並提出將 Mia 重新概念化為消費者友好型及整合到服務中的建議。發展出四個主題:(1)人性化 AI 而不去人性化照護,(2)我需要知道底層運作是什麼,(3)正確的工具,正確的地方,正確的時間?以及(4)在安全的基礎上讓它成為我的。這項研究提供了對年輕人對青少年心理健康中 genAI 聊天機器人態度、需求和要求的見解,對服務整合具有重要意義。此外,通過共同設計系統需求,這項工作為青少年心理健康環境中 genAI 聊天機器人的倫理、設計、開發、實施和治理提供了指導。

Explainable Fall Detection for Elderly Care via Temporally Stable SHAP in Skeleton-Based Human Activity Recognition

2604.13279v1 by Mohammad Saleh, Azadeh Tabatabaei

Fall detection in elderly care requires not only accurate classification but also reliable explanations that clinicians can trust. However, existing post-hoc explainability methods, when applied frame-by-frame to sequential data, produce temporally unstable attribution maps that clinicians cannot reliably act upon. To address this issue, we propose a lightweight and explainable framework for skeleton-based fall detection that combines an efficient LSTM model with T-SHAP, a temporally aware post-hoc aggregation strategy that stabilizes SHAP-based feature attributions over contiguous time windows. Unlike standard SHAP, which treats each frame independently, T-SHAP applies a linear smoothing operator to the attribution sequence, reducing high-frequency variance while preserving the theoretical guarantees of Shapley values, including local accuracy and consistency. Experiments on the NTU RGB+D Dataset demonstrate that the proposed framework achieves 94.3% classification accuracy with an end-to-end inference latency below 25 milliseconds, satisfying real-time constraints on mid-range hardware and indicating strong potential for deployment in clinical monitoring scenarios. Quantitative evaluation using perturbation-based faithfulness metrics shows that T-SHAP improves explanation reliability compared to standard SHAP (AUP: 0.89 vs. 0.91) and Grad-CAM (0.82), with consistent improvements observed across five-fold cross-validation, indicating enhanced explanation reliability. The resulting attributions consistently highlight biomechanically relevant motion patterns, including lower-limb instability and changes in spinal alignment, aligning with established clinical observations of fall dynamics and supporting their use as transparent decision aids in long-term care environments

摘要:老年護理中的跌倒檢測不僅需要準確的分類,還需要臨床醫生可以信賴的可靠解釋。然而,現有的事後解釋方法在逐幀應用於序列數據時,會產生時間上不穩定的歸因圖,臨床醫生無法可靠地根據這些圖進行行動。為了解決這個問題,我們提出了一個輕量級且可解釋的框架,用於基於骨架的跌倒檢測,該框架結合了一個高效的LSTM模型和T-SHAP,一種時間感知的事後聚合策略,該策略在連續時間窗口內穩定基於SHAP的特徵歸因。與標準SHAP不同,標準SHAP將每一幀獨立處理,T-SHAP對歸因序列應用線性平滑運算符,減少高頻變異,同時保留Shapley值的理論保證,包括局部準確性和一致性。在NTU RGB+D數據集上的實驗表明,所提出的框架實現了94.3%的分類準確率,端到端推斷延遲低於25毫秒,滿足中端硬體上的實時約束,並顯示出在臨床監測場景中部署的強大潛力。使用基於擾動的忠實度指標進行的定量評估顯示,T-SHAP相比於標準SHAP(AUP: 0.89 vs. 0.91)和Grad-CAM(0.82)提高了解釋的可靠性,並且在五折交叉驗證中觀察到一致的改進,表明了解釋可靠性的增強。最終的歸因一致地突出了生物力學相關的運動模式,包括下肢不穩定性和脊椎對齊變化,與已建立的跌倒動態臨床觀察相一致,並支持其作為長期護理環境中透明決策輔助工具的使用。

Hessian-Enhanced Token Attribution (HETA): Interpreting Autoregressive LLMs

2604.13258v1 by Vishal Pramanik, Maisha Maliha, Nathaniel D. Bastian, Sumit Kumar Jha

Attribution methods seek to explain language model predictions by quantifying the contribution of input tokens to generated outputs. However, most existing techniques are designed for encoder-based architectures and rely on linear approximations that fail to capture the causal and semantic complexities of autoregressive generation in decoder-only models. To address these limitations, we propose Hessian-Enhanced Token Attribution (HETA), a novel attribution framework tailored for decoder-only language models. HETA combines three complementary components: a semantic transition vector that captures token-to-token influence across layers, Hessian-based sensitivity scores that model second-order effects, and KL divergence to measure information loss when tokens are masked. This unified design produces context-aware, causally faithful, and semantically grounded attributions. Additionally, we introduce a curated benchmark dataset for systematically evaluating attribution quality in generative settings. Empirical evaluations across multiple models and datasets demonstrate that HETA consistently outperforms existing methods in attribution faithfulness and alignment with human annotations, establishing a new standard for interpretability in autoregressive language models.

摘要:歸因方法旨在通過量化輸入標記對生成輸出的貢獻來解釋語言模型的預測。然而,大多數現有技術是為基於編碼器的架構設計的,並依賴於線性近似,無法捕捉解碼器僅模型中自回歸生成的因果和語義複雜性。為了解決這些限制,我們提出了海森增強標記歸因(HETA),這是一個專為解碼器僅語言模型量身定制的新穎歸因框架。HETA結合了三個互補組件:捕捉跨層標記之間影響的語義過渡向量、建模二次效應的海森基敏感度分數,以及測量標記被屏蔽時信息損失的KL散度。這種統一設計產生了上下文感知、因果忠實和語義基礎的歸因。此外,我們引入了一個精心策劃的基準數據集,用於系統地評估生成設置中的歸因質量。在多個模型和數據集上的實證評估表明,HETA在歸因忠實性和與人類註釋的一致性方面始終優於現有方法,為自回歸語言模型的可解釋性建立了新的標準。

Explainable Graph Neural Networks for Interbank Contagion Surveillance: A Regulatory-Aligned Framework for the U.S. Banking Sector

2604.14232v1 by Mohammad Nasir Uddin

The Spatial-Temporal Graph Attention Network (ST-GAT) framework was created to serve as an explainable GNN-based solution for detecting bank distress early warning signs and for conducting macro-prudential surveillance of the interbank system in the United States. The ST-GAT framework models 8,103 FDIC insured institutions across 58 quarterly snapshots (2010Q1-2024Q2). Bilateral exposures were reconstructed from publicly available FDIC Call Reports using maximum entropy estimation to produce a dynamic directed weighted graph. The framework achieves the highest AUPRC among all GNN architectures (0.939 +/- 0.010), trailing only XGBoost (0.944). Ablation analysis confirms the BiLSTM temporal component contributes +0.020 AUPRC; temporal attention weights exhibit a monotonically decreasing pattern consistent with long-run structural vulnerability weighting. Permutation importance identifies ROA (0.309) and NPL Ratio (0.252) as dominant predictors, consistent with post-mortem analyses of the 2023 regional banking crisis. All data are publicly available FDIC Call Reports and FRED series; all code and results are released.

摘要:空間-時間圖注意力網絡(ST-GAT)框架的創建旨在作為一種可解釋的基於GNN的解決方案,用於檢測銀行危機的早期預警信號,以及對美國銀行間系統進行宏觀審慎監管。ST-GAT框架建模了8,103家FDIC保險機構,涵蓋58個季度快照(2010年第1季度至2024年第2季度)。雙邊風險敞口是從公開的FDIC通話報告中重建的,使用最大熵估計生成動態有向加權圖。該框架在所有GNN架構中實現了最高的AUPRC(0.939 +/- 0.010),僅次於XGBoost(0.944)。消融分析確認BiLSTM時間組件貢獻了+0.020 AUPRC;時間注意權重顯示出與長期結構脆弱性加權一致的單調遞減模式。置換重要性識別ROA(0.309)和NPL比率(0.252)為主要預測因子,與2023年地區銀行危機的事後分析一致。所有數據均為公開的FDIC通話報告和FRED系列;所有代碼和結果均已發布。